Back to the Future: Unsupervised Backprop-based Decoding for Counterfactual and Abductive Commonsense Reasoning

Abductive and counterfactual reasoning, core abilities of everyday human cognition, require reasoning about what might have happened at time t, while conditioning on multiple contexts from the relative past and future. However, simultaneous incorporation of past and future contexts using generative language models (LMs) can be challenging, as they are trained either to condition only on the past context or to perform narrowly scoped text-infilling. In this paper, we propose DeLorean, a new unsupervised decoding algorithm that can flexibly incorporate both the past and future contexts using only off-the-shelf, left-to-right language models and no supervision. The key intuition of our algorithm is incorporating the future through back-propagation, during which, we only update the internal representation of the output while fixing the model parameters. By alternating between forward and backward propagation, DeLorean can decode the output representation that reflects both the left and right contexts. We demonstrate that our approach is general and applicable to two nonmonotonic reasoning tasks: abductive text generation and counterfactual story revision, where DeLorean outperforms a range of unsupervised and some supervised methods, based on automatic and human evaluation.


Introduction
Everyday causal reasoning requires reasoning about the likely explanations to partially observable past and future (abductive reasoning (Peirce, 1960)) and reasoning about the alternative future based on counterfactual past (counterfactual reasoning). Such nonmonotonic reasoning requires She hit the rope and the tire fell on top of her.

Abductive Reasoning
Ray hung a tire on a rope to make his daughter a swing.

Past Observation
Ray ran to his daughter to make sure she was okay.

Future Observation Original Ending
Zeke thought about being a vampire or a wizard.
Then he decided on a scarier costume.
Zeke dressed up like a skeleton.

Zeke thought about
Lannister, but he didn't want to look like a Lannister.
He wanted to look like a Stark.
Zeke dressed up like a Stark.

Story Context
Zeke was throwing a party.
All his friends were dressing up for this Halloween party.
All his friends were dressing up for this Game of Thrones themed party. [Counterfactual]

Rewritten Ending
Hypothesis Counterfactual Reasoning Figure 1: DELOREAN, our proposed method, with generated reasoning results. Top: the goal in abductive reasoning is to generate a hypothesis (Y ) of what happened between the observed past (X) and future (Z) contexts. Bottom: In counterfactual reasoning, given a story context altered by a counterfactual condition, X, and the original ending Z, the goal is to generate a new ending Y which is coherent with X while remaining similar to Z. The story from TIMETRAVEL (Qin et al., 2019a) consists of five sentences. Our approach alternates forward (left-to-right) and backward (rightto-left) passes that iteratively refine the generated texts w.r.t context from each side.
inferring plausible but potentially defeasible conclusions from incomplete or hypothetical observations (Reiter, 1988). While humans are remarkably good at this type of causal reasoning, developing AI systems capable of nonmonotonic reasoning for a wide range of situations describable in natural language has been a major open research question.
More concretely, with abductive reasoning, the goal is to find the most plausible explanation for incomplete observations (Peirce, 1960). In the top part of Figure 1, given the first observation that Ray is "making his daughter a swing" and the later observation that he "ran to [her] to make sure she was okay," we can hypothesize that she somehow got hurt by the swing.
In contrast, counterfactual reasoning concerns the causal changes to future events given a change in the past condition (i.e., "counterfactual condition"; Goodman, 1947). For example, the bottom part of Figure 1 shows the original five sentence story (S 1 , ..., S 5 ) and an alternative counterfactual condition given in S 2 -that instead of being a generic "Halloween party", the new counterfactual condition is that it is going to be a "Game of Thrones themed party"! Given these, the problem we want to solve is to update the future events (S 3 , ..., S 5 ), so that instead of "Zeke dressed up as skeleton", we have "Zeke dressed up like a Stark". 2 Recently, two tasks and corresponding benchmarks have been introduced to tackle languagebased nonmonotonic reasoning: the ART dataset for abductive NLG , and the TIMETRAVEL dataset for counterfactual story rewriting (Qin et al., 2019a). Both tasks are framed as conditional generation, with multiple contexts to condition on. The currently dominant paradigm for conditional text generation tasks is fine-tuning pre-trained language models (LMs), such as GPT2 (Radford et al., 2019a), on large-scale training data for supervision. However, despite the large number of training examples, supervised approaches still perform considerably worse than humans and are subject to developing superficial strategies such as repeating the observations as is or memorizing prevalent surface patters specific in the dataset (Qin et al., 2019a). Furthermore, having to require largescale training data for each domain and task would be utterly inefficient for broad-coverage nonmonotonic reasoning in language.
In this paper, we investigate an alternative path toward language-based nonmonotonic reasoning using pre-trained language models as is. Intuitively, both the abductive and counterfactual reasoning requires learning coherent patterns in narrative, which should be already available in large-scale pretrained language models. However, the key challenge is that most generative language models are trained to condition only on the left context, or to perform narrowly scoped text-infilling. This paper presents DELOREAN: DEcoding for nonmonotonic LOgical REAsoNing, an unsupervised decoding algorithm that only assumes off-theshelf left-to-right language models with no supervision. The key intuition of our algorithm is incorporating the future through back-propagation, during which, we only update the internal representation of the output while fixing the model parameters. More specifically, DELOREAN alternates between the forward and backward passes, where the forward pass performs left-to-right inference given the left context (roughly maximizing P (Y |X) in Figure 1), while the backward pass instills the right constraint through right-to-left backpropagation with a task-specific loss (roughly maximizing P (Z|XY )). The forward and backward outputs are mixed into a single vector, from which tokens are sampled to generate the desired output. To choose the best output across iterations, we employ an unsupervised ranking step based on BERT's next sentence prediction task to measure coherence (Devlin et al., 2018).
On both tasks, DELOREAN outperforms all other unsupervised methods in terms of both automatic metrics and human evaluation, demonstrating that nonmonotonic reasoning through conditional decoding is a promising research direction. Moreover, outputs produced by our model are judged as more coherent than those from the supervised models. In sum, our study shows that backpropagation-based decoding may enable additional future applications of unsupervised generation and reasoning.

Background
Most NLP benchmarks have focused on reasoning about information that is entailed from the premise. For instance, natural language inference (NLI;Bowman et al., 2015) focuses primarily on whether a hypothesis is entailed from a given premise, which means the information stated in the hypothesis is a subset of the information provided in the premise. However, it has been noted that human reasoning is often the other way, where hypotheses often contain new information that was not available in the premise, but plausibly true (but  Figure 2: Illustration of the DELOREAN decoding procedure, using abductive reasoning as an example. At initialization (upper-left box), the language model (LM) initializes the logits

Ray
Past context X Input: Ray hung a tire on a rope to make his daughter a swing. based on the future constraint Z (red box). The backward pass then performs back-propagation and produces the backward logits Past context X Input: Ray hung a tire on a rope to make his daughter a swing.
In the subsequent forward pass, for each step n, we compute the forward logits Ray hung a tire on a rope to make his daughter a swing.
Ray hung a tire on a rope to make his daughter a swing.
Ray ran to his daughter to make sure she was okay.
Ray hung a tire on a rope to make his daughter a swing. possibly defeasible with new additional context) (Johnson-Laird, 2006;Mercier and Sperber, 2017). This type of reasoning corresponds to nonmonotonic reasoning (Kraus et al., 1990), as it contradicts the monotonicity property according to which valid arguments cannot be made invalid by adding premises. We study two tasks of that nature: abductive reasoning ( §2.1) and counterfactual reasoning ( §2.2).

Abductive Reasoning
Abductive reasoning aims at finding the most likely explanation to partial observations (Peirce, 1960). It has a central role in the human ability to "read between the lines," and is crucial for language acquisition (Andersen, 1973), understanding sentences in discourse (Hobbs et al., 1993), and many more. Despite the importance, however, relatively little focus has been given to it in NLP research.
Recently,  propose the abductive reasoning task. Given two observations, the goal is to determine the most likely explanation of what happened in-between. The dataset introduced for the task, ART, consists of 20k observations derived from the first and last sentence of stories in the ROCStories dataset (Mostafazadeh et al., 2016a). We focus on the abductive NLG setup introduced in the paper, which is framed as a conditional generation task where a plausible explanation to the observations must be generated using language. The authors reported the performance of several pre-trained LM-based baselines and showed promises and limitations of such approaches.

Counterfactual Reasoning
Counterfactual reasoning aims at inferring alternative past events that could have happened given a certain change in conditions (Goodman, 1947;Starr, 2019). While counterfactual reasoning plays an important role in AI systems (Isard, 1974;Gins-berg, 1986), it requires causal reasoning abilities, which are arguably absent from current associationbased AI (Pearl and Mackenzie, 2018). While there has been work on counterfactual reasoning in NLP, including recognizing counterfactuals in text (Son et al., 2017), and improving the performance of NLP tasks using counterfactual learning (Lawrence et al., 2017;Lawrence and Riezler, 2018), it remains a major research challenge.
Recently, Qin et al. (2019a) introduce the task of counterfactual story generation. Given a 5-sentence original story, and an alternative context in which the second sentence of the story was altered by a counterfactual, the task is to generate a new 3sentence story ending that addresses the alternative beginning while minimally editing the original ending. The associated TIMETRAVEL dataset is based on fictional narratives from ROCStories, for which counterfactual contexts and alternative endings are crowdsourced, yielding 29,849 problem instances. Qin et al. (2019a) report several baseline performances, and find that models based on pre-trained LMs produce output that recognize the counterfactual, but generated endings which deviated considerably from the original storyline. In contrast, in the supervised setup, models optimize the easier of the two goals and generate endings that are overly similar to the original endings.

The DELOREAN Approach
Humans make inferences based on available information and refine them when new information arrives. Since currently available pre-trained LMs generate text by sequentially predicting the next token from left to right, they are incapable of conditioning on future constraints. Therefore, we propose DELOREAN: an unsupervised backprop-based decoding algorithm, which is summarized in Algorithm 1, illustrated in Figure 2, and detailed below. DELOREAN intermittently refines the predictions to cohere with either the context or the constraints (Section 3.1). The candidate generations are then ranked by coherence (Section 3.2).

Decoding Strategy
Given context text X, the goal is to generate continuation text Y = (y 1 , . . . , y N ), such that Y satisfies certain constraints according to the reasoning tasks, usually defined based on another context Z (see Figure 1; we discuss the task-specific constraints in the respective task sections).
(2) 11: Mix forward and backward logits, Eq.(3) 12: end for 13: Sample candidate Y from logitsỸ and add to Ys 14: end for 15: Rank Ys by coherence Output: The most coherent generated text Y from Ys The proposed approach interleaves two procedures, namely, forward and backward, that produce and iteratively refine the generation, for a predefined number of iterations T . In particular, the forward pass ensures the generated text is a fluent continuation of the context X, while the backward pass informs the model about the constraint and steers the generation to satisfy it.
As detailed below, the backward pass uses gradient descent to update the generation Y . However, Y is a discrete text that is not differentiable. Instead, throughout the algorithm, we maintain a soft representation of the sequenceỸ = (ỹ 1 , . . . ,ỹ N ), whereỹ n ∈ R V represents the logits of the n-th token and V is the vocabulary size. After the logits are refined over multiple iterations of the forward and backward passes, we generate discrete text at each step by sampling from y n ∼ softmax(ỹ n /τ ), where τ > 0 is the temperature.
Backward The backward pass uses gradient backpropagation to update the generation with respect to the constraint. Specifically, we express the task-specific constraint as a loss function L(X,Ỹ (t−1) , Z) that evaluates how well the generation Y (approximated with the soft representatioñ Y ) obeys the constraint (see the subsequent sections for concrete instantiations of the loss). The goal of this pass is thus to minimize the loss w.r.t the generation. Specifically, at iteration t, for each step n in the generation, we update its logits with: where ∇ỹ n L(X,Ỹ (t−1) , Z) is the gradient of the constraint-informed loss L w.r.t the n-th logits, and λ ∈ R is the step size. In practice, we may repeat the gradient updates multiple times in a single pass.
Forward The forward pass ensures that Y is fluent and coherent with the preceding context X. At iteration t, for a particular step n, we compute the forward logits with the LM: 1:n−1 ).
We then mix the nth-step forward and backward logits to get the final logits of iteration t: where 0 < γ < 1 is the mixing weight. The resulting logitsỹ (t) n are then fed to the LM to compute the forward logits at the (n + 1)th step (Eq.2). This way, information from the backward pass is integrated into the left-to-right generation process to produce text that is informed by the constraint.
We pre-define the number of tokens N required by the backward pass, but we allow the forward pass to generate more than N tokens if those are needed to obtain complete sentences. In that case, we set the logits of the extra tokens to the forward logits, without mixing:ỹ (t) n =ỹ (t),f n for n > N . We then prune any trailing tokens in the sampled text to get complete sentences.

Ranking
The output of the decoding step is a list of candidate generations for each iteration: Y s = {Y (t) |t = 1, ..., T }. We further use an unsupervised approach to rank and pick the best sample as the final output. Specifically, we take advantage of the BERT model, which was pre-trained with a next-sentence prediction (NSP) objective. Given two sentences A and B, we use NSP to compute the likelihood of B following A as a proxy for coherence: where c(·, ·) denotes the coherence score. This score is used to evaluate the quality of a given candidate continuation Y by measuring (1) its compatibility with the subsequent text of the context X, (2) the internal consistency of Y if it consists of multiple sentences, and (3) the compatibility of Y with its right-side text when it is applicable.  Table 1: Automatic evaluation results on the abductive task, using the test set of ART.

Task 1: Abductive Reasoning
Each instance in the ART dataset consists of two observations O 1 , O 2 and a hypothesis H that explains the two observations. These inputs naturally map to X, Z and Y in our framework. Formally, the abductive generation task aims to maximize P (Y |X, Z) -i.e. models must consider both left and right contexts (X and Z) jointly.

Task Setup
Constraints We maximize Z given XỸ by defining the loss function as the cross-entropy loss of generating Z given XỸ with the LM: 3 L(X,Ỹ , Z) := − N Z n=1 log P LM (z n |X,Ỹ , Z 1:n−1 ), (5) where P LM (a j |a 1:j−1 ) is the likelihood of generating token a j given the preceding text a 1:j−1 . Following the earlier study of the task (Bhagavatula et al., 2019), we also prepend Z to X to "leak" the future information to the LM. That is, we replace X with Z e X in the above equation, where e denotes a special end-of-text token. However, the comparisons with respective baselines below show the prepended Z is minor to the performance.
Ranking We rank candidates by the overall coherence after inserting Y in between X and Z: ranking score(Y ) = c(XY, Z) + c(X, Y Z). (6) Hyperparameters We use GPT2-345M (Radford et al., 2019b) as the pre-trained LM for all models. We use the ART development set to select hyperparameters. We use greedy decoding for our method and top k decoding (Fan et al., 2018) (k = 40, τ = 0.7) for our baselines. Other hyperparameters are outlined in Appendix A.1.

Ray drive his car on a steep mountain road.
Ray was fine but his car was totaled.
As he drives the car to the top of the mountain his car is hit by a car.
Peter was excited to go to the Sanders rally in New Hampshire.
He couldn't wait to vote for him.

? ?
He has a long history of supporting Bernie Sanders and was excited to see him in person. Figure 3: Examples of generated hypotheses on three abductive reasoning cases. Given observations O1 and O2, DELOREAN generates a hypothesis explaining the observations.

Experimental Setup
Baselines We compare our method against baselines from . The unsupervised baselines use a pre-trained GPT-2 model to generate Y given a prompt text-either the observation X alone (Zero-Shot X ) or Z e X (Zero-Shot ZX ). The supervised method (Sup) follows the same input format as Zero-Shot ZX , but finetunes GPT-2 on the ART training set. Finally, our knowledge-informed baseline (+COMET-Emb) further augments the representation of Sup with knowledge from COMET .
To separately study the contribution of our decoding strategy and ranking component, we also report the performance of ranking the baseline outputs. Specifically, we let each baseline generate 20 candidates and rank them by coherence (Eq. 6). 4

Results
Automatic Evaluation We report the same metrics as : BLEU-4 (Papineni et al., 2002), ROUGE-L (Lin, 2004) and BERTSCORE (Zhang et al., 2019) (with the bertbase-uncased model). The results in Table 1 show that DELOREAN performs best among the unsupervised systems across all metrics. We also note that our ranking step improves both the performance of our model and that of the zero-shot baselines.
Human Evaluation We conduct two sets of human evaluations on 100 test examples using crowdworkers from Amazon Mechanical Turk. In the scoring setting, presented in Table 2, workers were presented a pair of observations (X and Z) and a generated hypothesis Y , and asked to rate the 4 We tried ablating the ranking component from our method in preliminary experiments, and found that ranking is essential to obtaining good performance. By adding ranking to our baselines, we assess the contribution of our decoding strategy.   coherence of the hypothesis with respect to the observation X (X-Y ), the observation Z (Y -Z), and both (X-Y -Z), on a 4-point Likert scale. In the pairwise comparison setting, presented in Table 3, workers were presented the outputs from a pair of systems (DELOREAN and baseline) and asked to choose the better output in terms of the same coherence criteria. Each example was labeled by 3 workers. 5 In both evaluation setups, our method substantially outperform the unsupervised baselines, achieving a relative improvement of 36% − 215% with respect to Y -Z coherence. Our method also  outperform the supervised methods with respect to X-Y coherence (Table 2), and achieve competitive performance in the pairwise comparison (Table 3). Again, the ranking component contributes to increasing performance for the zero-shot baselines. Finally, the large performance gap between the methods and human-written explanations stresses the difficulty of this reasoning task and warrants future research. Figure 3 presents two example outputs produced by DELOREAN. We can see our approach generates reasonable hypotheses by taking into account both the past and future contexts. For instance, in the first example, the future observation (O2) "car was totaled" indicates that Ray had a car accident, which is correctly captured in the generated hypothesis "car is hit by a car".

Task 2: Counterfactual Reasoning
Given an original story ending Z of story context X ori , and a counterfactual condition X that changes X ori to invalidate Z (see Fig. 1), the task is to generate a new story ending Y that minimally edits the original ending Z to regain coherence with the counterfactual condition X (Qin et al., 2019a).

Task Setup
Constraints The constraint we enforce is that Y is close to Z (i.e., minimal edits). We impose this constraint by minimizing their KL divergence: where, with a slight abuse of notation, Z is the one-hot distribution of the tokens in the original ending. That is, we encourage the generated logits to recover the original ending.  Figure 4: Human calibration results for counterfactual generation in terms of weighted harmonic mean of coherence and min-edit, H β = (1+β 2 )·coherence·min edit β 2 ·coherence+min edit , as a function of the scaling factor β. Low β values assign more weight to coherence, and high β values emphasize more on min-edit.
Ranking We rank the candidates based on both their coherence with the context, as well as the internal coherence between the multiple sentences of each candidate (rewritten ending, consists of 3 sentences). More concretely, given a candidate Y , we compute the aggregated coherence score: Hyperparameters We largely follow the same setting as in the abductive reasoning task, but tune hyperparameters on the TIMETRAVEL development set. Deviations from these settings are outlined in Appendix A.2.

Experimental Setup
Baselines We compare our method with baselines from Qin et al. (2019a). The zero-shot baseline uses the pre-trained GPT-2 model to generate Y as a continuation to the counterfactual condition X. It is the most apt comparison to our method which also doesn't require additional supervision. We also experiment with two baselines that finetune GPT-2 on the original story X ori Z to fit the model to the story domain, either with an LM objective (FT) or a tailored conditional objective that encourages minimal edits of Z (Recon+CF). 6 Finally, we report the performance of a supervised that the price was $1,000. Kay was not happy with the price.

Rewritten Ending
She knew of a cool place online that did custom fits really cheaply, and ordered from there.
They browsed shirts from a variety of stores. Tara picked out a floral patterned shirt that she liked best. Tara looked forward to wearing it.
They sent her a shirt that fit her perfectly. Tara was so excited to wear it. She looked forward to wearing it.
Tara wanted to buy a new shirt for her upcoming school formal. She went to the mall with her mom.

Story Context Counterfactual Condition
Original Ending

Rewritten Ending
Shane enjoyed volunteering his time helping others.
John was not allowed to be friends with Shane anymore. this bothered John greatly but his mom explained the reasons. She explained that Shane was a bad influence on John.
John was a good student and was always looking for ways to help others. They were both very kind and caring people. Shane was a member of the Boy Scouts of America.
Shane and John were best friends at school. Shane was caught stealing and got suspended from school.

Counterfactual Condition
Original Ending Rewritten Ending Figure 5: Examples of generated story endings on three counterfactual reasoning cases. Given a story context, a counterfactual condition, and a original ending, DELOREAN generates a rewritten ending which is coherent with the counterfactual condition and is similar to the original ending.  baseline (Sup), in which GPT-2 is fine-tuned to produce the gold Y from X ori Z and X.

Results
Automatic Evaluation Following Qin et al. (2019a), we report BERTSCORE (Zhang et al., 2019), which was shown to best correlate with human judges' notion of counterfactual coherence, and BLEU-4 and ROUGE-L, which better measure minimum-edits. We find that the discriminative baselines achieve the highest degree of plot fidelity. Meanwhile, DELOREAN achieves the highest BERTSCORE for counterfactual coherence.
Human Evaluation We repeat the human evaluation setup from Section 4.3. Presented with the original story, the counterfactual condition X, and the generated ending Y , workers were asked to judge (1) the coherence of Y with respect to the X; and (2) to what extent the generated ending minimally-edits the original ending. 7 In order to judge both criteria, we report the weighted harmonic mean H β of these scores across a range of weights β (Figure 4). Our results show that DELOREAN is the only model that maintains a consistent balance between coherence (1.66) and minimal edits (1.54). While the ranking-augmented zero-shot model produces the most coherent endings (coherence = 1.8), it deviates from the original ending. As β is increased (i.e., increasing importance of minimal edits), its weighted performance drops considerably, indicating it cannot generate new endings that follow the original plot of the story (min-edit = 1.25). Conversely, Recon+CF generates stories that are faithful to the original endings, but are far less coherent with the counterfactual condition (coherence = 1.23). Through human annotation, we found that Recon+CF copies the original ending word-forword in a 84% of cases.
The pairwise comparison results in Table 5 parallel these observations. DELOREAN significantly outperforms the discriminative approaches (Recon+CF and Sup+Disc) in coherence, while falling short of the Zero-shot re-ranked baselines.
In minimal edits, this pattern is flipped with our approach outperforming Zero-shot baselines considerably and losing to the discriminative baselines.
Qualitative Analysis Figure 5 provides two example results for counterfactual story rewriting by DELOREAN. The approach successfully captures the causal relations between events and properly rewrites the endings with minimal edits. For instance, in the first example, given the counterfactual condition that "Tara ordered a shirt online" (as opposed to the original "went to mall"), the rewritten ending is about "sent shirt" to Tara (as opposed to the original "browsed from stores"). The last sentence of the original ending "She looked forward to wearing it" is correctly preserved as it is coherent with the counterfactual condition.

Related Work
Unsupervised text generation. Unsupervised approaches are often applied to problems that copy information from a source text into decoded text.
Unsupervised paraphrasing requires repeating this information (Miao et al., 2019;Bao et al., 2019), as does translation, but with a bilingual transformation (Artetxe et al., 2017;Lample et al., 2018). In summarization there is an additional task to select a subset of the original text (Baziotis et al., 2019;Schumann et al., 2020;West et al., 2019). In cases where information is mostly copied from the original, auto-encoding objectives can ensure the correct information is captured (Bao et al., 2019;Baziotis et al., 2019;Artetxe et al., 2017). This work tackles problems where generation is more open-ended. Rather than reproducing information from the prompt, generations should agree with and expand on it, making autoencoding less applicable.
Controllable language generation. Earlier approaches for controllable generation involved preserving the content of text while changing it along discrete dimensions, such as theme, sentiment, or style (Koncel-Kedziorski et al., 2016;Hu et al., 2017;Ficler and Goldberg, 2017;Shen et al., 2017;Lample et al., 2019). Recent works such as Grover (Zellers et al., 2019) and CTRL model (Keskar et al., 2019) used these ideas to augment transformer language models that can condition on struc-tured metadata such as source, domain, etc. The Plug & Play model (PPLM; Dathathri et al., 2019) controls topic and sentiment in an approach similar to ours that involves forward and backward passes to update token distributions. However, PPLM relies on trained attribute discriminators for supervision, while our method is unsupervised. While these models are restricted to specific dimensions, often with pre-defined values, our model can adjust to any open-ended textual constraint. Perhaps the most similar work in that aspect is the "text infilling" models, which, however, are in a more narrow setting by filling only a relatively short text span (Devlin et al., 2018;Zhu et al., 2019;Donahue et al., 2020), and more restrictive due to the reliance on an extra right-to-left language model (Sun et al., 2017) or a pre-specified generation length (Zeldes et al., 2020, which is not publicly available).
Reasoning about narratives. A prominent resource from recent years is the RocStories corpus (Mostafazadeh et al., 2016b), consisting of 98K crowdsourced 5-sentence everyday life stories. It was used for the story cloze task whose goal was to predict the story ending from its first 4 sentences, but gained popularity and became the base of additional benchmarks (Rashkin et al., 2018). Additional related work includes "script knowledge", i.e. learning about prototypical series of events (Schank and Abelson, 1977;Chambers and Jurafsky, 2008;Pichotta and Mooney, 2014), temporal commonsense (Granroth-Wilding and Clark, 2016;Li et al., 2018), and modeling pre-and post-conditions of events (Roemmele et al., 2011;. Qin et al. (2019b) studied conversation modeling that reads and connects the dots of events in related documents. Finally, a recent line of work explores counterfactual questions in reading comprehension Tandon et al., 2019), but instantiates the problem of counterfactual reasoning as a multiple choice task.

Conclusion
We presented DELOREAN, an unsupervised LMbased approach to generate text conditioned on past context as well as future constraints, through forward and backward passes considering each condition. We demonstrated its effectiveness for abductive and counterfactual reasoning, on which it performed substantially better than unsupervised baselines. Our method is general and can be easily adapted for other generative reasoning tasks.