Learning to Explain: Answering Why-Questions via Rephrasing

Providing plausible responses to why questions is a challenging but critical goal for language based human-machine interaction. Explanations are challenging in that they require many different forms of abstract knowledge and reasoning. Previous work has either relied on human-curated structured knowledge bases or detailed domain representation to generate satisfactory explanations. They are also often limited to ranking pre-existing explanation choices. In our work, we contribute to the under-explored area of generating natural language explanations for general phenomena. We automatically collect large datasets of explanation-phenomenon pairs which allow us to train sequence-to-sequence models to generate natural language explanations. We compare different training strategies and evaluate their performance using both automatic scores and human ratings. We demonstrate that our strategy is sufficient to generate highly plausible explanations for general open-domain phenomena compared to other models trained on different datasets.


Introduction
Allowing machines to provide human acceptable explanations has long been a difficult task for natural language interaction (Carenini and Moore, 1993).In order to provide explanations, systems need to acquire sophisticated domainknowledge (Winograd, 1971), conduct causal reasoning over complex set of events (Hesslow, 1988) and over narrative chains (Chambers and Jurafsky, 2008), and apply commonsense knowledge (Levesque et al., 2011).
Past work has demonstrated that by leveraging human-curated structured knowledge bases such as WordNet (Miller, 1995) or ConceptNet (Liu and Singh, 2004), a system can learn to rank or choose between multiple plausible explanations Phenomenon The city councilmen refused the demonstrators a permit because ?Original The city councilmen feared violence.

L2E-Seq2Seq (greedy):
They were not allowed to march in the city.L2E-Seq2Seq (beam): They did not have a permit.LM-1B: They were not allowed to use the Cape Town airport.L2W: It was the only thing in the city that could be done.Open-Subtitle: I don't know.and reach high accuracy (Luo et al., 2016;Sasaki et al., 2017).Recent successes have also shown that structured knowledge is not needed if one can train a language model on a large quantity of text.Such model can rank explanations based on the probability that each explanation might appear in natural text (Trinh and Le, 2018).
While ranking explanations is an important task, the nature of explanation is more general than this.For one phenomenon, there might be many acceptable, natural, and useful explanations.In our work, instead of simply ranking or choosing explanations generated by humans, we propose to advance this important domain by directly generating the explanation.We measure success based on whether the generated sequence is grammatically correct and is a fluent, natural, and plausible explanation.This task has two advantages.First, it allows us to explore whether such a task is computationally feasible given the current learning framework.Second, answering open-domain why-questions with plausible answers can make chitchat dialogue system more engaging, especially in response to "why" questions (which previous systems typically answer with degenerate responses such as "I don't know").
We show that simply training a language model on previously existing datasets is not enough.However, by leveraging dependency parsing patterns, we are able to construct two new datasets that will allow modern neural networks to learn to generate general-domain explanations plausible to humans.These new datasets of naturally occurring self-explanations (statements with "because", unprompted by a question) provide excellent training signal for generating novel explanations for a given phenomenon.We conduct human experiments on the important features that contribute to plausible explanations, and we describe a simple procedure that can rephrase Whyquestions into a statement so our model can also function as a single-round chitchat chatbot that can answer Why-questions.

Learning to Explain
We use the discourse extractor developed by Nie et al. (2017).This extractor first filters sentences that contain a particular discourse marker (in our case, the marker "because").It then uses predefined, pattern-based rules on the dependency parse obtained from the Stanford CoreNLP dependency parser (Manning et al., 2014) to split the sentence into two semantically complete sentence clauses, which can be referred as S1 and S2.Dependency parsing allows us to isolate explanations and phenomena from exogenous modifying phrases.Using these patterns to parse sentences with "because" also allows us to deal with the free order of the explanation and phenomenon in English.We formulate the L2E task as: given the phenomenon S1, the model needs to learn to generate a plausible explanation S2.
In addition to retrieving the phenomenonexplanation pair, we additionally retrieve five sentences that immediately precede the phenomenon to provide context.We concatenate the context with S1 using a special separation token, resulting in the sequence C1, C2, ..., C5 <SEP> S1.We hypothesize that context will allow the model to generate more thematically relevant explanations.We refer to this setting as the L2EC task.
At last, we describe a procedure in Algo-Algorithm 1 Q-to-S1 Input: question q, dependency parsed.
Remove "Why".Start at the ROOT of q: subj = NSUBJ or NSUBJPASS aux = first dependent in [AUX, COP, AUXPASS] vp (lemma) = all remaining dependents if aux in ["do", "does", "did"] then vp = apply tense/person of aux to vp (lemma) else vp = aux vp (lemma) end if s = subj vp rithm 1 that uses dependency parsing to turn Whyquestions into the statement format of S1.This allows us to generate explanations as responses to Why-questions.

Language Modeling
Language modeling focuses on modeling the joint probability of a sequence p(X = x 1 , ..., x n ).Using chain rule, this can be decomposed as p(X) = n t=1 p(x t |x <t ), the product of conditional probabilities.The model parameterized by θ optimizes to maximize the log of the likelihood function L(X; θ) = n t=1 p θ (x t |x <t ).In a neural language model, proposed by Bengio et al. (2003), a recurrent neural network is trained by truncated backpropogation through time to learn to model (theoretically) an infinitely long sequence.

Sequence to Sequence Modeling
First introduced by Sutskever et al. (2014), sequence-to-sequence (Seq2Seq) modeling estimates a conditional probability distribution of sequence Y given sequence X. p(Y |X), where X = {x 1 , ..., x n }, and Y = {y 1 , ..., y k }.The overall objective function is similar to a language model: to maximize the log-likelihood of the probability of the Y sequence given the X sequence: L(Y, X; θ, ψ) = k t=1 p θ,ψ (y t |y <t , X), with parameters θ for the encoder and ψ for the decoder.In our work, we experiment with different architectures for the encoder and decoder.

Data
We provide data accessibility statements in Appendix A.1 for each dataset we use to train and evaluate our models.Our constructed dataset and web demo code are publicly available1 .

Training Data
NewsCrawl Dataset We build up our training dataset from two large news datasets: Gigaword Fifth Edition (Parker et al., 2011) and NewsCrawl (Bojar et al., 2018).These two datasets contain news stories from 2001-2017, and are nonoverlapping.We built our dataset of News explanation pairs using the pipeline described in Section 2 and then split into training, validation, and test.More details are reported in Appendix A.2.
BookCorpus BookCorpus is a set of unpublished novels (Romance, Fantasy, Science fiction, and Teen genres) collected by Zhu et al. (2015).
We use a publicly available pre-trained BookCorpus language model from Holtzman et al. (2018).
We refer to this model as L2W.
Language Modeling One Billion This dataset (LM-1B) is currently the largest standard training dataset for language modeling, roughly the same size as BookCorpus.This dataset is a subset of the NewsCrawl dataset, from 2007-2011.We use a pre-trained language model on this corpus from Jozefowicz et al. (2016).We refer to this model as LM-1B.

Evaluation Data
News Commentary (NC) Dataset We collect pairs from a public dataset that contains predominantly commentary written about current news2 .We use this dataset as the main evaluation of the news-based explanation because 1).It is a separate dataset without any overlap with NewsCrawl; 2).This dataset still belongs to the same news domain, so it provides an in-domain evaluation for L2E, L2EC and LM-1B models.

Winograd Schema Challenge Subset (WSC-G)
We use 61 example sentences in the Winograd Schema Challenge that contain the words "because" or "so".Similar to Trinh and Le ( 2018), we substitute the ambiguous pronouns with the correct referent and ask the model to generate the correct explanation "the trophy is too big" to the phenomenon "The trophy doesn't fit in the suitcase".
Choice of Plausible Alternatives (COPA) Roemmele et al. (2011) proposed a task that contains questions such as "The women met for coffee.What was the CAUSE of this?", and the model is asked to choose between two pre-defined causes.In our setting, we directly ask the model to generate a cause.For language models, we append "because" to the end of each COPA sentence and ask the model to generate the rest.

Language Model Training
We use the same language model described in Holtzman et al. (2018).We train 10 epochs for both L2E and L2EC.We use a one layer LSTM (Hochreiter and Schmidhuber, 1997) with 2048 hidden state dimensions and 256 word dimensions.We chose these hyperparameters by tuning on the validation set of each task.Our language model achieved 51.64 perplexity on the L2E test set, and 37.61 perlexity on the L2EC test set.

Seq2Seq Model Training
We experiment with two architectures: LSTM encoder-decoder and Transformer (Vaswani et al., 2017).We find that with the L2E task, the Transformer architecture performed better, and for the L2EC task, the LSTM encoder-decoder performed better.We suspect that Transformer is worse when the source sequence is too long.We tune each architecture's hyperparameters extensively and we pick the best architecture for each task to evaluate on the evaluation datasets.

Automatic Evaluation
We within the same domain.We find that L2E/L2EC based models obtained higher scores across all automatic metrics in Table 4.Our results also demonstrate that context matters for explanation.The L2EC task models, trained on context, can generate higher quality explanations than contextfree L2E task models.

Human Evaluation
Ranking Explanations We evaluate the models' relative performance on generating explanations through a survey with human evaluators.75 participants were recruited using Amazon's Mechanical Turk (AMT).Each evaluator saw 10 prompts from a single dataset, and ranked 7 to 9 explanations: the original explanation extracted from the dataset and the explanations generated by different models.30 participants saw prompts from our Winograd dataset, 30 participants saw prompts from News Commentary, and 15 participants saw prompts from COPA.We report the results of this evaluation in the Human Ranking sub-section of Table 4.

Rating Explanations
In a followup survey, 60 human evaluators on AMT rated explanations generated by the L2E-Seq2Seq model with beam search and the original (between participants).Ratings were from 0 (extremely bad) to 1 (extremely good) along various dimensions of explanation quality.Results of this study are shown in Table 5. Generated explanations overall were rated worse than human explanations, but tended to be more good than bad (≥ 0.5) on all measures.

Discussion
The nature of phenomenon-explanation mapping has always been one-to-many.People can offer drastically different explanations to the same phenomenon.We argue that requiring the machine to generate plausible explanations is more useful and therefore a better goal for models to achieve.Models trained on traditional chatbot corpora are unable to answer why questions because of data sparsity.We note that the generated results are not similar to the original explanations but are often acceptable by human assessment.

Features of Explanations
In the human rating experiment, our model was overall rated higher than the original explanations only on the grammaticality measure.However, this measure seems least representative of the overall explanation  quality: ratings for most features were highly correlated with each other (0.771-0.865), but not with grammaticality (0.196-0.323).This shows that, while we can achieve plausible explanations with our models, more research is required in order to reach human-level quality.
Explaining as Generating Even though formulating the task of providing explanation as a sequence generation task allows us to leverage the rapid advancements in the natural language generation community, we sidestep a vast amount of literature that aims to provide informatively correct explanations as well as grounding explanations theoretically to the causal understanding of the situation (Halpern and Pearl, 2005).We also suffer from the same drawbacks noticed in natural language generation papers such as brevity and generic responses, failure to leverage long context, and being data hungry (Holtzman et al., 2018).
Exploring Linguistic Structures The curated dataset of explanation-phenomenon pairs provides an opportunity to explore descriptive structures and features of explanations.In principle, one can use this dataset to formulate frequent and common syntactic and semantic patterns for natural-sounding explanations.This would aid our understanding of how why-questions can be addressed satisfactorily.

Conclusion
We present the task of generating plausible explanations as an important goal for neural sequenceto-sequence models.We curate a large dataset of phenomenon-explanation pairs so that these models can learn to provide plausible explanations as judged by humans, and formulate responses to general Why-questions.
the NewsCrawl dataset.This dataset is not shuffled.

A.2 Training Data Curation
In order to automatically curate a sizable amount of training data, we choose large corpora that are made of news articles, due to the well-formedness of sentences and there are many phenomenonexplanation pairs in news stories.We use Gigaword fifth edition (Parker et al., 2011) which contains news stories from seven news agencies over the span of 2001-2010.We extracted paragraphs and tokenized the sentences.We discard non-English characters.Another large dataset of new articles comes from WMT-18, the NewsCrawl dataset (Bojar et al., 2018).This dataset spans from 2007-2017 collected from the RSS (Rich Site Summary) feed of 18 news agencies.The only overlapping agency between Gigaword and NewsCrawl is Los Angeles Times.In addition to the randomly shuffled dataset we obtained from the WMT-18 website, we additionally contacted the organization for the unshuffled version of data.We refer to this dataset as the NewsCrawl-ordered.This dataset is slightly larger than the current released version of NewsCrawl and contains a couple of months of early 2018 data.We shuffle and then split both datasets into train/valid/test in standard 0.9/0.05/0.05.We use the validation and test set on this task to pick the best performing model.

A.3 Language Model Details
We use adaptive gradient descent (AdaGrad) with learning rate 0.1 and weight decay of 1e-6.

A.4 Seq2Seq Model Details
We built and trained our Seq2Seq model using OpenNMT (Klein et al., 2017).For the L2E task, we used a 6-layer Transformer model, with hidden dimension 512, feedforward layer dimension 2048, and 8 attention heads.We train with dropout rate of 0.1 with Noam optimizer.For the L2EC task, we used a 2-layer LSTM model with 650 hidden dimension size for both encoder and decoder, as well as for word embedding.We train with dropout rate of 0.2 and Noam optimizer.Schema Challenge explanations q q q q q q q q q q q q q q q q q q q q q q q Winograd & COPA News Commentary Winograd COPA News

Figure 1 :
Figure 1: We show the original Winograd schema sentence, the original offered explanation, and generated responses from our models.

Figure
Figure Correlations of human ratings on Winograd Schema Challenge explanations

Figure S2 :
Figure S2: The average ranking of each model's generated response (lower is better).

Table 1 :
Top are training datasets and bottom are evaluation datasets for each task.We report the average length of sentences for each dataset (S1 and S2 combined).News Commentary with context has 156.3 words on average.

Table 2 :
Example pairs from our highest performing models with the original sentence as a reference.Human ranking score lower is better.We provide examples of especially poor-rated generations in the Appendix.
use automatic metrics to evaluate the 8 models' performance on the News Commentary dataset.Even though this is a non-overlapping held-out dataset to our news training data, it is still

Table 3 :
We report the best per-token accuracy and perplexity evaluated for each tuned architecture on the L2E/L2EC validation dataset.

Table 4 :
BLUE, ROGUE, METEOR are evaluated on News Commentary test data.Any model with C in the name is evaluated with full context.Models with † are pre-trained models from other work.Only L2E-Seq2Seq uses the Transformer architecture, the rest LSTM.In human ranking, we report the average rank across participants.Top ranking is 0 and lowest ranking is 1.

Table 5 :
Results of rating study with human evaluators, average rating and bootstrapped 95% CI.