Looking Beyond Sentence-Level Natural Language Inference for Question Answering and Text Summarization

Natural Language Inference (NLI) has garnered significant attention in recent years; however, the promise of applying NLI breakthroughs to other downstream NLP tasks has remained unfulfilled. In this work, we use the multiple-choice reading comprehension (MCRC) and checking factual correctness of textual summarization (CFCS) tasks to investigate potential reasons for this. Our findings show that: (1) the relatively shorter length of premises in traditional NLI datasets is the primary challenge prohibiting usage in downstream applications (which do better with longer contexts); (2) this challenge can be addressed by automatically converting resource-rich reading comprehension datasets into longer-premise NLI datasets; and (3) models trained on the converted, longer-premise datasets outperform those trained using short-premise traditional NLI datasets on downstream tasks primarily due to the difference in premise lengths.


Introduction
Large-scale, open Natural Language Inference (NLI) datasets (Bowman et al., 2015;Williams et al., 2018) have catalyzed the recent development of NLI models that exhibit close to human-level performance. However, the use of these NLI models for other downstream Natural Language Processing (NLP) tasks has met with limited success. Two of the most popular downstream tasks where NLI models' use has been explored are Multiple-choice Question Answering (MCRC) and Checking Factual Correctness of Summaries (CFCS) (Trivedi et al., 2019;Falke et al., 2019;Clark et al., 2018) -both of which can easily be cast into the NLI form, as shown in Figure 1. Looking closely at the composition of these datasets, it is evident that there is a stark difference in the lengths of the contexts/premises when compared to NLI datasets. As seen in Table 1, traditional NLI datasets have much

Passage/Premise/Full text
The first time my father and I ever went fishing became a family legend. … We were hot, sticky, and mad that the fish refused to suck up our night crawlers . … While driving out we saw a truck with a boat trailer and boat that was stuck in the mud. … my dad helped pull the man from the mud. In return, this fellow gave dad some fish … we agreed to take in the fish as if we had caught them. … As we got up to do the dishes, mom cleared her throat. "I just have one question of you two great fishermen, How was it again that you two managed to not only clean your fish, but also freeze them before you got home."

Hypothesis (NLI)
The fishing became a family legend because they make themselves a fool in front of the mother.

Question (QA)
Why did fishing become a family legend?
They make themselves a fool in front of the mother

Answer Summary (Summarization)
The father and son pretended catching a fish which was given to them making a fool in front of the mother. That is how the fishing trip became a family legend. shorter premises than the context texts from these downstream tasks. Prior research has shown that the capabilities required for handling local inference are very different from those required to perform inference over longer forms of text (Cooper et al., 1996;Lai et al., 2017a). In this work, we explore this conflict as a major bottleneck in the utility of NLI models (trained on traditional NLI datasets) for downstream NLP tasks. We compare the usage of long and short-premise NLI datasets on the dowsntream tasks of MCRC and CFCS, which have inherently long contexts.
Such a comparison has not been possible thus far because traditional NLI datasets do not exhibit long premises. We hence look towards recasting other tasks into NLI to generate datasets that can  be used to evaluate our conjecture. The Question-Answering (QA) task can easily be cast into the NLI form, and QA datasets (Rajpurkar et al., 2016;Lai et al., 2017b;Khashabi et al., 2018;Sun et al., 2019;Huang et al., 2019) encompass a variety of semantic phenomena that only occur in longer (con)texts. We leverage the resource-rich MCRC task to generate long-premise NLI datasets for our experiments via an automated conversion strategy. We contrast the zero-shot model performance on the MCRC and CFCS tasks of a model pre-trained on our converted long-premise NLI dataset and a model trained on two short-premise NLI datasets -MNLI and ANLI. We show that the presence of longer premises is the primary factor for better performance on these two tasks. We further discuss other potential confounding factors for this performance difference -such as dataset vocabulary overlap and dataset conversion strategiesand eliminate the possibility of their contribution through targeted experiments.

Related Work
Performance on the NLI task has improved significantly due to the availability of large scale datasets (Bowman et al., 2015;Williams et al., 2018) that can be used to train data-hungry deep learning models (Kapanipathi et al., 2020;Wang and Jiang, 2015), including transformer-based architectures (Devlin et al., 2018). However, there has been very limited success in translating this performance to downstream NLP tasks. Work relevant to the use of these NLI models for downstream tasks can be categorized into two categories: (1) work focusing on using models trained on shortpremise NLI datasets with fixed or learned aggregations over segmented premises to perform a target downstream task with long contexts (Falke et al., 2019;Trivedi et al., 2019); and (2) work addressing the need for task-specific NLI datasets (Kryściński et al., 2019;Demszky et al., 2018;Welleck et al., 2019).
Despite several attempts, efforts to apply models trained on available NLI datasets to downstream NLP tasks such as MCRC and CFCS have had limited success. Trivedi et al. (2019) use hand-crafted rules to first cast MCRC to NLI; and subsequently divide a long passage into smaller sentence-level premises. They use a pre-trained NLI model to evaluate per-sentence relevance scores concerning one particular hypothesis, and then combine the resulting scores using a learned representation aggregation module to assess the answer given the long passage. Falke et al. (2019) apply a similar approach for the CFCS task, and divide both the provided summary as well as the source documents into single-sentence premises and hypotheses. They use a max pooling operation over the entailment scores of all sentence-level premise-hypothesis pairs to obtain the factual correctness score for each provided summary. Both these works note that models trained on sentence-level NLI datasets do not transfer well to the MCRC and CFCS tasks. We argue that this divide and conquer approach is not ideal for the problem, and highlight the need for an NLI dataset with longer premises.
Another line of research focuses on re-casting datasets from other tasks into an NLI form to facilitate the direct use of NLI models on downstream tasks like MCRC and CFCS. Khot et al. (2018) use manual annotation to re-cast SciQ (a QA dataset) to SciTail -an NLI dataset. However, Clark et al. (2018) show that an NLI model trained on Sci-Tail does not perform well on the task of MCRC. Similarly, Kryściński et al. (2019) create an automatically generated training dataset for CFCS. Even though the generated data has relatively long contexts, analysis in Zhang et al. (2020) demonstrated that a model trained on the aforementioned data showed performance improvement only when the token overlap with the source is high. Besides, Demszky et al. (2018) derive an NLI dataset by converting subsets of various QA datasets. They try two approaches for the conversion -rule-based and neural. For the rule-based approach, they extract POS tags from the question-answer pair and apply hand-crafted rules on them to convert the pair to a hypothesis sentence. Their neural approach uses a trained SEQ2SEQ BiLSTM-with-copy model (Gu et al., 2016) to convert each question, answer pair into a hypothesis sentence (the corresponding passage being the premise). While their approach looks promising, they do not show the utility of these converted datasets by training an NLI model on them. Thus, it remains unclear whether the NLI datasets generated by the conversion are beneficial for NLP tasks. We posit that this direction of research is promising and largely unexplored. In our work, we attempt to leverage the abundance of large and diverse MCRC datasets to generate long-premise NLI datasets, and show that such datasets are useful towards addressing downstream NLP tasks such as MCRC and CFCS which have inherently long contexts.

NLI for Downstream Tasks
Typically, NLI is cast as a multi-class classification problem, where given a premise and a hypothesis, the model classifies the relation between them as entails, contradicts, or neutral. For the two downstream tasks under consideration: (1) MCRC: Multiple Choice Reading Comprehension, and (2) CFCS: Checking Factual Correctness of Text-Summarization; differentiating between the neutral and contradicts class is often unnecessary. The task is thus reduced to a two-class problem; where the contradicts and neutral classes are clubbed into a not-entails class.
MCRC can be cast as an NLI task by viewing the given context as the premise and the transformed question-answer combinations as different hypotheses (Trivedi et al., 2019). The multiple answeroption setting can then be approached as: (a) an individual option entailment task, where more than one answer-option can be correct; or (b) a multiclass classification task across all the answer options, when only a single correct answer exists.
CFCS can also be reduced to a two-class NLI problem. A factually correct summary should be entailed by the given source text -it should not contain hallucinated facts, and it should also not contradict facts present in the source text.

The Long Premise Conjecture
Despite being ideally suited for reduction to NLI, both MCRC and CFCS have proved to be difficult to solve using models trained on short-premise NLI datasets (Trivedi et al., 2019;Falke et al., 2019). Datasets for these tasks contain significantly longer contexts than traditional short-premise NLI datasets (Table 1). This shift in the text length brings about a fundamental change in the nature of the NLI problem. Thus, models trained on shortpremise NLI datasets are incapable of performing inference over longer texts, which we posit as the main cause for their poor performance on downstream tasks like CFCS and MCRC * .
The paucity of manually-annotated long-premise NLI datasets poses a barrier to assessing this conjecture. We thus shift our focus towards leveraging the abundance of large and diverse MCRC datasets which can be easily recast into NLI form. While the CFCS task also provides a similar opportunity, the sheer lack of annotated training instances inhibits its use. Table 3 shows the abundance of training instances in MCRC datasets, and highlights the deficiency in CFCS datasets.
In the following section, we present our conversion strategy for reformatting MCRC datasets into long-premise NLI datasets, which are needed to test the long premise conjecture.

Conversion of MCRC to NLI
As shown in Figure 1, we can convert MCRC datasets into two-class NLI datasets by reusing the passage as a premise, and paraphrasing the question along with each answer option as individual hypothesis options.
We begin by using a rule-based conversion method. A dependency parse of both the question and answer option is generated using the Stanford CoreNLP package (Qi et al., 2018). This is followed by the application of conversion rules proposed by Demszky et al. (2018) to generate a hypothesis sentence. However, due to the limited coverage of rules and errors in the dependency parse, some of the generated hypotheses sound unnatural (e.g. the first example in Table 2). In order to generate more natural and diverse hypotheses and to get broader coverage in conversion, we implement a neural conversion strategy. Four people were hurt when overhanging metalwork crashed onto a stage in a Toronto park Saturday afternoon.. "." # # # # "." was the number of peo Four were hurt when overhanging metalwork crashed onto a stage in a Toronto park Saturday afternoon.  (Hermann et al., 2015), which makes the generated sentences coherent; (2) followed by Google's sentence compression dataset (Filippova and Altun, 2013), which limits the generated sequence to a single sentence; and (3) finally the annotated dataset provided by Demszky et al. (2018) which has around 71, 000 question-answer, hypothesis pairs from various QA datasets. Based on manual inspection, we find that the hypotheses generated by this method indeed sound more natural and diverse than the ones produced by the rule-based conversion † . In some cases, however, the generated hypotheses either discard crucial information, or contain hallucinated facts that do not convey the exact information in the source questionanswer pair (Table 2). We thus define a hybrid conversion strategy, combining the desirable aspects of the rule-based and neural conversion strategies.
We design a heuristic to compose a hybrid dataset to overcome the caveats in the neural conversion. We use the number of words in the question-answer concatenation as a proxy for the expected length of the hypothesis. We target the problems of hallucination and missing information in the neural conversions by accepting only those neural-generated hypotheses that lie in the range of 0.8 and 1.2 times the length of the questionanswer concatenation. We replace the rejected neural hypotheses with the rule-based hypothesis, if rule-based conversion is feasible; or with the † More examples of conversion results are presented in Appendix D.
question-answer concatenation otherwise; as seen in Table 2. The selection policy is driven by the need to get more natural and coherent conversions without compromising on the accuracy and preservation of factual information in the question and answer option. The choice of the specific range is purely empirical in nature. We use this hybrid conversion strategy to generate long-premise NLI datasets from MCRC datasets for our experiments and evaluate them in contrast to short-premise NLI datasets.

Experimental Setup
Our experiments involve zero-shot evaluations of pre-trained NLI models on downstream NLP tasks. In this section, we describe the transfer learning setup and the datasets used in our experiments.
5.1 A Transferable NLI model ‡ In order to use a pretrained NLI model for MCRC and CFCS, we need that model to be agnostic to the peculiarities of the downstream task. We use a standard transfer learning setting where the model architecture is divided into two parts: (1) a transferable entailment scorer; and (2) a weight-free comparator on top of the scorer. Each premisehypothesis pair is encoded as a single sequence, and passed through the transferable entailment scorer to produce an entailment score. Depending on the problem setup, the comparator can either be a sigmoid function (for a two-class entailment problem) as shown in Figure 2; or a softmax function (for multiple choice classification) as shown in Figure 3. This segmentation of the model makes it easy to transfer the model weights across different tasks. For the entailment scorer, we use a 2-layer feedforward network on top of the [CLS] token of ‡ Code available here: https://github.com/ nli-for-qa/transformers-nli pre-trained RoBERTa § .
To evaluate the transferability of the entailment model, we perform various zero-shot evaluations. This requires interpreting the entailment scores a bit differently for each task. To transfer the weights from a multiple choice classification model (Figure 3) to a two class entailment model (Figure 2), we copy the weights of the transferable entailment scorer as-is, and calibrate a threshold using a dev set to interpret the outputs from the sigmoid comparator for binary classification. Since the softmax comparator does not need any calibration, the transfer in the other direction, i.e., from a two class entailment model to a multiple choice classification model is more straightforward -we simply copy the weights of the transferable entailment scorer.

Datasets
For our experiments, we use the NLI form of 4 MCRC datasets (created using the conversion method described in Section 4); 2 CFCS datasets; and 2 traditional short-premise NLI datasets. These datasets are described below: consists of tuples of the form article, sentence , where the articles are taken from the CNN/DailyMail corpus, and sentences come from the summaries for these articles generated using several state-of-the-art abstractive summarization models. Ranking Summaries for Correctness (evaluation set) (Falke et al., 2019) consists of articles and a set of summary alternatives for each article, where § The RoBERTa model is pre-trained on the masked language modeling objective as described in Liu et al. (2019). We obtain it from the HuggingFace library (Wolf et al., 2019). ¶ Questions where the answer is "None of the above" are removed from the CosmosQA dataset.   Table 1.

Long-Premise NLI Datasets:
We convert the following MCRC datasets to generate long-premise NLI datasets using the hybrid conversion strategy described in Section D. We refer to these datasets with a subscript converted attached to the source MCRC dataset. As seen from Table 1 and Table 3, RACE is the largest dataset amongst the MCRC datasets, and also has the longest average premise length. In line with this intuition, the model trained on the RACE converted dataset outperforms the converted forms of other MCRC datasets (Appendix B) on all the evaluation tasks. Due to this, in the following section, we only discuss and report results on the RACE converted dataset for brevity and clarity of comparison. Amongst the traditional NLI datasets, we use MNLI and ANLI for a good mix of average premise lengths along with a large number of training samples.

Results and Discussion
Our experiments aim to answer the following questions: (1) Are long premise NLI datasets more use- ful for downstream tasks compared to short premise NLI datasets? (Section 6.1 & 6.2); (2) How much do possible confounding factors affect our empirical evaluations? (Section 6.3). To answer these, we perform zero-shot evaluation on the MCRC and CFCS tasks.
We contrast the performance of NLI models trained on the short-premise NLI datasets (MNLI, ANLI) with one that is trained on a long-premise NLI dataset (RACE converted ). The models trained on short-premise NLI datasets are evaluated in two ways: (1) by treating the entire premise as input; and (2) by segmenting the premise into shorter segments and using a max aggregation over the entailment scores of all the segments (Falke et al., 2019). Since the model architecture remains the same, we use the name of the training dataset to refer to the model trained on it.

Evaluation on MCRC
For evaluating NLI models on the MCRC task, we use the hybrid conversion (Section 4) to create evaluation datasets. The MultiRC dataset contains multiple correct answer options and hence is evaluated with each question-answer option posed as a separate example. DREAM and CosmosQA datasets have only a single correct answer-option (out of 3 answer-options). Hence, for these datasets, a multiclass classification problem is posed as described in Section 3, using the model architecture described in Figure 3.
As seen in Table 4, the model trained on the long-premise RACE converted dataset outperforms the model trained on the short-premise NLI datasets in both regular and segmented forms of evaluation. We assert that this difference in performance can  be attributed to the difference in premise lengths of the datasets. However, we allow for the possibility that using the same conversion strategy for the evaluation datasets could potentially benefit the model trained on RACE converted . We discuss such confounding factors in Section 6.3.2.

Evaluation on CFCS
Evaluations on CFCS are set up in two ways: (1) CFCS as classification: In this form, given a document and a corresponding summary sentence, the model needs to identify if the sentence is factually correct with respect to the document (entailed) or not. In order to perform the classification, we first obtain our entailment scorer by fine-tuning the multiple choice classification model (Figure 3) on the RACE Converted dataset and use the dev set || to calibrate a threshold ** (described in Section 5.1) to obtain the two-class entailment model (Figure 2). || We use the dev and test dataset provided by Kryściński et al. (2019) for this task. ** Balanced accuracy is used to find the best threshold.
(2) CFCS as ranking: Given a source document and a set of five machine generated summaries, the model is required to rank at least one factually correct summary above all incorrect summary alternatives. Note that a variable number of these five machine generated summaries can be factually correct (Falke et al., 2019). However, there is always at least one incorrect summary in this set.    The results of evaluations on the MCRC and CFCS tasks -which inherently contain long contexts -provide strong evidence supporting our long premise conjecture.

Confounding Factors
Natural language experiments are often vulnerable to artifacts that may leak exploitable signals into the training data that the model can fit on. Such extraneous factors, if present, can prevent the empirical isolation of the premise-length as a major factor. We therefore discuss and eliminate the two most obvious potential confounding factors.

Vocabulary Overlap
In the zero-shot evaluation setup, a high vocabulary overlap between the training data and the target data can potentially help a model perform better. To eliminate this confounding factor from our experiments, we calculate the vocabulary overlap of RACE, MNLI and ANLI (training data) with the 3 MCRC datasets (evaluation data). We define overlap as: # words in Vocab(eval. data) Table 8 shows that all the datasets have similar vocabulary overlap with the three MCRC datasets. However, from Table 4, we see that the model trained on RACE converted considerably outperforms the models trained on the short-premise NLI datasets. This indicates that vocabulary overlap is not playing a big role in the model's performance.
To substantiate this claim, we further evaluate the two models on those subsets of the three MCRC datasets that consist only of examples where the vocabulary overlap is high (≥ 0.9). Table 7 shows that the performance of the two models on these high vocabulary overlap subsets is similar to their overall performances on the respective datasets. We can thus conclude that vocabulary overlap is not helping either of the models in terms of predictive performance.

Automated Conversion
We evaluate the models trained on the shortpremise NLI datasets and RACE converted on the converted forms of the MCRC datasets. However, only the model trained on the RACE converted dataset is exposed to the same conversion strategy during training. It is therefore possible that the conversion   To create a setting where the difference is vivid, we design the annotation subsets such that the RACE converted model gives an accuracy of around 50% using the hybrid conversion strategy. The independent manual annotations prevent any exploitable signal from leaking into the training data of the model through the conversion mechanism. We compare the performance of models trained on converted forms of the RACE dataset using both our hybrid strategy as well as manual annotation. † † We manually annotate 100 examples from MultiRC and 50 each from ComsosQA and DREAM. MultiRC is evaluated at an option-level with each question-answer pair considered an individual example. On the other hand, CosmosQA and DREAM are evaluated at a question-level, with each example consisting of three question-answer pairs, and one label corresponding to the correct answer option. Table 9 shows that the RACE converted model performs better on the manually annotated subset; this eliminates the possibility of the conversion mechanism being a confounding factor in our results. † † It is important to note that this setting is solely for the purpose of establishing the role of the hybrid conversion strategy as a potential confounding factor in the performance of the RACEconverted model. The absolute accuracy numbers are not reflective of the model performance on the overall dataset.

MultiRC DREAM CosmosQA
Automatic 52.08 50.00 50.00 Manual 55.21 54.00 72.00 Table 9: Evaluation of the RACE converted model on the manually annotated subset of the MCRC datasets as compared to the same subsets with Hybrid conversion.

Conclusion
The difficulty of transferring entailment (NLI) knowledge to downstream NLP tasks can be largely attributed to the difference in data distributions, specifically the premise lengths. Models trained on short-premise NLI datasets are not very good at performing inference over longer texts, which is a central feature of important downstream tasks such as QA and text summarization. We leverage the abundance of large and diverse MCRC datasets and the ease of conversion from MCRC into the NLI format to automatically and scalably create a long-premise NLI dataset to test this long-premise conjecture. We show that the long-premise nature of the converted dataset indeed helps achieve better performance on the downstream tasks of MCRC and CFCS when compared against models trained on traditional short-premise NLI datasets. We further discuss and eliminate possible confounding factors in our experiments to ensure the validity of our results.
Our work highlights a major shortcoming in popular NLI datasets that limits their usefulness to downstream NLP applications; and emphasizes the need for long-premise NLI datasets. Future work in this direction can take us closer to realizing the full potential of NLI as a fundamental task in natural language understanding.

Ethical Considerations
In this work, we use open source datasets, libraries, and services which are freely available and appropriately cited. We do not release the converted form of the MCRC dataset in respect of existing copyright; however, we provide all the information required to reproduce our experimental setup, datasets, and results in the content of the main paper as well as in the appendix.
All rules used in the conversion strategies (Section 4), as well as the manual annotations performed as part of the confounding factors analysis, were produced solely by the group of authors. Our work did not involve any external human subjects; and did not require institutional review.
Looking forward, it is certainly possible that the neural conversion strategy proposed by us in Section 4 may be applied by readers of this work in other -potentially scaled-up -contexts. Since the conversion is used as a means to an end (producing an appropriate long-premise dataset) rather than as the central contribution of the current work, we do not provide an extended analysis of the pros and cons of this strategy.

A Quality of Converted Datasets
We evaluate the quality of the converted datasets by using benchmarks that probe the trained models for semantic phenomenon (Poliak et al., 2018;Richardson et al., 2019). Tables 10 and 11 shows the performance of the models on the different semantic phenomena. We show that the converted NLI datasets are at par, and sometimes better than a specially curated NLI dataset such as MNLI. For the purpose of illustration, we report results on the RACE converted dataset and MNLI.

B Comparison of Converted MCRC Datasets
We use the hybrid conversion strategy discussed in Section 4 to generate long-premise NLI datasets from each of the MCRC datasets -RACE, Mul-tiRC, DREAM and CosmosQA. As seen from Table 1 and 3, RACE has the longest average premise length as well as the most number of training examples. It is thus, intuitive to see from Tables 12, 13 and 14 that the RACE converted model outperforms the other converted models in each of the tasks.
C Reproducibility Checklist C.1 Details of the datasets used Table 15 gives the train/dev/test splits of the various source datasets used in this work. We follow the same splits after the conversion to NLI form. Since the test datasets are not openly available for Mul-tiRC and CosmosQA, we use the corresponding dev sets to report our results. Table 16 shows the proportion (absolute numbers) of neural, rule-based and Q+A examples in the final hybrid datasets.

C.2 Neural Conversion
We use the following training sequence to obtain the final neural conversion model:  (Demszky et al., 2018). ‡

C.2.1 Experiments
• The hyperparams for the models used throughout the Section 6 are shown in Table 18. These were obtained using minimal manual tuning.
• The threshold for CFCS as classification experiments (Section 6.2 (1)) we calculated by tuning for best balanced accuaracy https://scikit-learn. org/stable/modules/generated/ sklearn.metrics.balanced_ accuracy_score.html.          Table 18: Hyperparam setting for the models trained on MCRC datasets and MultiNLI (same for Q+A, Neural, and Hybrid from). These are common for all models in the experiments (Section 6).

D Conversion examples
Rule-based Neural Q: How do suburban commuters travel to and from the city in Copenhagen at present? A: About one third of the suburban commuters travel by bike.
Suburban commuters travel to about one third of the suburban commuters travel by bike and from the city in Copenhagen at present.
Suburban commuters travel to and from the city in Copenhagen at present by bike

Q:
What's the best title of the passage? A: Blame! Blame! Blame! The best title of the passage's blame.
The best title of the passage is Blame! Blame! blame! blamage! Q: What influence did the experiment have on Alexander ? A: He realized that slowing down his life speed could bring him more content.
The experiment had he realized that slowing down his life speed could bring him more content on Alexander.
The experiment influenced Alexander to realize that slowing down his life speed could bring him more content.
Q: Which of the following is TRUE about the report findings? A: The reading scores among older children have improved.
The reading scores among older children have improved is TRUE.
It is true that the reading scores among older children have improved. Q: How might Air New Zealand's video partner benefited from helping to make this video? A: Coincides with the 50th anniversary of Sports Illustrated's Swimsuit franchise Air New Zealand's video partner might benefited from helping to make this video by coincides with the 50th anniversary of Sports Illustrated's Swimsuit franchise.
Air New Zealand's video partner benefited from helping to make this video because it coincides with the 50th anniversary of Sports Illustrated's Swimsuit franchise.
Q: Did Alexander set out to secure his northern fronts and was he able to accomplish this goal? A: Yes and yes.

Unable to Convert
Alexander set out to secure his northern fronts and was he able to accomplish this goal.