Open-Domain Why-Question Answering with Adversarial Learning to Encode Answer Texts

In this paper, we propose a method for why-question answering (why-QA) that uses an adversarial learning framework. Existing why-QA methods retrieve “answer passages” that usually consist of several sentences. These multi-sentence passages contain not only the reason sought by a why-question and its connection to the why-question, but also redundant and/or unrelated parts. We use our proposed “Adversarial networks for Generating compact-answer Representation” (AGR) to generate from a passage a vector representation of the non-redundant reason sought by a why-question and exploit the representation for judging whether the passage actually answers the why-question. Through a series of experiments using Japanese why-QA datasets, we show that these representations improve the performance of our why-QA neural model as well as that of a BERT-based why-QA model. We show that they also improve a state-of-the-art distantly supervised open-domain QA (DS-QA) method on publicly available English datasets, even though the target task is not a why-QA.


Introduction
Why-question answering (why-QA) tasks retrieve from a text archive answers to such why-questions as "Why does honey last such a long time?" Previous why-QA methods retrieve from a text archive answer passages, each of which consists of several sentences, like A in Table 1 (Girju, 2003;Higashinaka and Isozaki, 2008;Oh et al., , 2013Sharp et al., 2016;Verberne et al., 2011), and then determine whether the passages answer the question. A proper answer passage must contain (1) a paraphrase of the why-question (e.g., the underlined texts in Table 1) and (2) the reasons or the causes (e.g., the bold texts in Table 1) of Q Why does honey last a long time? A While excavating Egypt's pyramids, archaeologists have found pots of honey in an ancient tomb: thousands of years old and still preserved. Honey can last a long time due to three special properties. Its average pH is 3.9, which is quite acidic. Such high level of acidity is certainly hostile and hinders the growth of many microbes. Though honey contains around 17-18% water, its water activity is too low to support the growth of microbes. Moreover honey contains hydrogen peroxide, which is thought to help prevent the growth of microbes in honey. Despite these properties, honey can be contaminated under certain circumstances. C Because its acidity, low water activity, and hydrogen peroxide together hinder the growth of microbes. Table 1: Answer passage A to why-question Q and its compact answer C the events described in the why-question, both of which are often written in multiple non-adjacent sentences. This multi-sentenceness implies that the answer passages often contain redundant parts that are not directly related to a why-question or its reason/cause and whose presence complicates the why-QA task. Highly accurate why-QA methods should be able to find the exact reason sought by a why-question in an answer passage without being distracted by redundancy.
In this paper, we train a neural network (NN) to generate, from an answer passage, a vector representation of the non-redundant reason asked by a why-question, and exploit the generated vector representation as evidence for judging whether the passage answers the why-question. This idea was inspired by Ishida et al. (2018), who used a seq2seq model to automatically generate such compact answers as C in Table 1 from the answer passages retrieved by a why-QA method. Compact answers are sentences or phrases that express the reasons for a given why-question without redundancy. If we can use such automatically generated compact-answers to support a why-QA method in finding the exact reason of a whyquestion in these passages, why-QA accuracy may be improved. We actually tried this idea in a preliminary study in which we generated a compact answer from a given question-passage pair by using the compact-answer generation method of Iida et al. (2019) and used the generated compactanswer along with the given question-passage pair to find proper answer passages. However, we were disappointed by the small performance improvement, as shown in our experimental results.
We chose an alternative approach. Instead of generating a compact answer of an answer passage as word sequences, we devised a model to generate a compact-answer representation, which is a vector representation for a compact answer, from an answer passage. Inspired by the generative adversarial network (GAN) approach (Goodfellow et al., 2014), we developed an adversarial network called the Adversarial networks for Generating compact-answer Representation (AGR). Like the original GAN, an AGR is composed of a generator and a discriminator: the generator network is trained for generating (from answer passages) fake representations to make it hard for the discriminator network to distinguish these fake representations from the true representations derived from manually created compact-answers.
We combined the generator network in the AGR with an extension of the state-of-the-art why-QA method . Our evaluation against a Japanese open-domain why-QA dataset, which was created using general web texts as a source of answer passages, revealed that the generator network significantly improved the accuracy of the top-ranked answer passages and that the combination significantly outperformed several strong baselines, including a combination of a generator network and a BERT model (Devlin et al., 2019). This combination also outperformed a vanilla BERT model, suggesting that the generator network in our AGR may be effective even if it is combined with many types of NN architectures. Another interesting point is that the performance improved even when we replaced, as the inputs to AGR, the word embedding vectors that represent an answer passage, with a random vector. This observation warrants further exploration in our future work.
Finally, we applied our AGR to a distantly su-  (Chen et al., 2017), which is an extension of a machinereading task, to check whether it is applicable to other datasets. We combined our generator network with a state-of-the-art DS-QA method, OpenQA (Lin et al., 2018), and used a generated compact-answer representation from a given passage as evidence to 1) select relevant passages from the retrieved ones and 2) find an answer from the selected passages. Although the task was not our initial target (why-QA) and the answers in the DS-QA task were considerably shorter than those in the why-QA, experiments using three publicly available datasets (Quasar-T (Dhingra et al., 2017), SearchQA (Dunn et al., 2017), and Triv-iaQA (Joshi et al., 2017)) revealed that the generator network improved the performance in most cases. This suggests that AGR may be applicable to many QA-like tasks.
2 Why-QA Model Figure 1 illustrates the architecture of our why-QA model and the AGR. Our why-QA model computes the probability that a given answer passage describes a proper answer to a given why-question using the representations of a question, an answer passage, and a compact answer. The probability (the why-QA model's final output) is computed from these representations by our answer selection module, which is a logistic regression layer with dropout and softmax output.
The representations of why-questions and answer passages are generated by Convolutional Neural Networks (CNNs) (Collobert et al., 2011;LeCun et al., 1998) that (1) are augmented by two types of attention mechanisms, similarityattention  and causality-attention , and (2) are given two types of word embeddings, general word embeddings computed by word2vec (Mikolov et al., 2013) using Wikipedia and causal word embeddings (Sharp et al., 2016). Note that in computing a question's representation, the answer passage is given to the question encoder to guide the computation. Likewise the passage encoder is given the question and the representation of the compact answer. We represent these information flows with dotted arrows in Fig. 1(a).
The representations of compact answers are created by a generator network called a fakerepresentation generator (F in Fig. 1(a)), which is pre-trained in an adversarial learning manner ( Fig. 1(b)). During the training of the whole why-QA model, the generator's parameters are fixed and no further fine-tuning is conducted.
In the next section, we describe our main contribution: the AGR and the fake-representation generator. The entire why-QA model can be seen as an extension of the state-of-the-art why-QA method . Its details are described in Section A of the supplementary materials.

Adversarial Networks for Generating
Compact-answer Representation

Adversarial training
Generative adversarial networks (GANs) (Goodfellow et al., 2014) are a framework for training generative models based on game theory. Unlike the original GANs, which generate such data samples as images and compact answers from noise, our AGR generates useful vector representations from meaningful text passages. To clarify the difference, we explain our AGR with three subnetworks: two generators, F and R, and a discriminator, D, as in Fig. 1(b). Generator F takes as input passage p drawn from prior passage distribution d p and outputs vectorp as a fake representation of a compact answer. We call F a fake-representation generator. R, which we call a real-representation generator, is given sample c taken from manually created compact-answers and provides vector c as a real-representation of the sampled compactanswer. Discriminator D has to distinguish fakerepresentationp from real-representationc of a compact answer. These three networks play an adversarial minimax game; fake-representation generator F creates a fake compact-answer representation that is hard for the discriminator to distinguish from the representations of manually created compact-answers, and discriminator D and generator R simultaneously try to avoid being duped by generator F . These processes should allow generator F to learn how to generate a representation of a proper compact-answer from an answer passage. In addition, since passage p and compact answer c are dependent on question q, the generation of the compact-answer representations by F and R is conditioned by question q, like in the conditional GANs (Mirza and Osindero, 2014). We trained our AGR with the following minimax objective:

Generator and discriminator
In our implementation, both F and R are networks with identical structure called Encoder. They are defined as follows, where p, c, and q are respectively an answer passage, a manually created compact-answer, and a why-question: Here θ F and θ R represent the parameters of networks F and R. The details of Encoder are described below.
Discriminator D(r) takes as input r, either the output of F (p|q) or that of R(c|q), and computes the probability that given representation r comes from a real compact-answer using a feedforward network with two hidden layers (100 nodes in the first layer and 50 in the second layer) and a logistic regression layer on top of the hidden layers. We used sigmoid outputs by the logistic regression layer as the output probability.  Figure 2 illustrates the architecture shared by our fake-representation generator F and real-representation generator R, namely, Encoder(t; θ, q), where θ is a set of parameters, q is a why-question, and t is either an answer passage or a manually created compact-answer. Encoder(t; θ, q) first represents question q and passage/compact-answer t with pre-trained word embeddings, which are supplemented with attention mechanisms.

Encoder
The resulting attention-weighted word embeddings are given to convolutional neural networks (CNNs) that generate a single feature vector, which is an output/value of Encoder(t; θ, q).
In the following, we give an overview of the word embeddings, the attention mechanisms, and the CNNs used in Encoder(t; θ, q). All of these techniques were proposed by previous works. Further details are given in Section B of the supplementary materials.

Word embeddings
The pre-trained word embeddings used in Encoder(t; θ, q) were obtained by concatenating two types of d-dimensional word embeddings (d = 300 in this work): general word embeddings and causal word embeddings.
General word embeddings are widely used embedding vectors (300 dimensions) that were pretrained for about 1.65 million words by applying word2vec (Mikolov et al., 2013) to about 35 million sentences from Japanese Wikipedia (January 2015 version).
Causal word embeddings (Sharp et al., 2016) were proposed for representing the causal associations between words. Sharp et al. (2016) cre-ated a set of cause-effect word pairs by paring each content word in a cause part with each content word in an effect part of the same causality expression, such as "Volcanoes erupt because magma pushes through vents and fissures." In this work, we extracted 100 million causality expressions from 4-billion Japanese web pages using the causality recognizer of Oh et al. (2013). Then, following Sharp et al. (2016), we trained 300dimensional causal word embeddings for about 1.85 million words by applying the generalized skip-gram embedding model of Levy and Goldberg (2014) to the causality expressions.

Attention
We also applied two types of attention mechanisms to the above word embeddings. The first type of attention, similarity-attention , was used for estimating the similarities between words in question q and those in passage/compact-answers t and focusing on the attended words as those that directly indicate the connection between the question and passage/compact-answers. Basically, the mechanism computes the cosine similarity between the embeddings of the words in q and t, and uses it for producing attention feature vector a s j ∈ R for word t j in passage/compact-answers.
Another attention mechanism, causalityattention , was proposed for focusing on passage words causally associated with question words. They used normalized point-wise mutual information to measure the strength of the causal associations with the causality expressions used for creating the causal embeddings. The scores are used for producing causality-attention feature vector a c j for word t j . Finally, we form two attention feature vectors, a s = [a s 1 , · · · , a s |t| ] and a c = [a c 1 , · · · , a c |t| ], concatenate them into a = [a s ; a c ] ∈ R 2×|t| , and produce attention-weighted word embedding t att of given text t, which is either an answer passage or a compact answer: where W t ∈ R 2d×2d and W a ∈ R 2d×2 are trainable parameters, t is the representation of text t, and ReLU represents the rectified linear units.

3.3.3
CNNs t att is given to CNNs to generate final representation r t of a given passage/compact-answer t.
The CNNs resembles those in Kim (2014). Convolutions are performed over the word embeddings using both multiple filters and multiple filter windows (e.g., sliding over 1, 2, or 3 word windows at a time and 100 filters for each window). An average pooling operation is applied to the convolution results to generate representation r t , which is the output/value of Encoder(t; θ, q); r t = Encoder(t; θ, q). In our experiments, we set the dimension of representation r t to 300.

Datasets
We used three datasets, W hySet, CmpAns, and AddT r, for our why-QA experiments. W hySet and AddT r were used for training and evaluating the why-QA models, while CmpAns was used for training AGR.
The W hySet dataset, which was used in previous works for why-QA (Oh et al., , 2013, is composed of 850 Japanese why-questions and their top-20 answer passages (17,000 question-passage pairs) obtained from 600 million Japanese web pages using the answerretrieval method of Murata et al. (2007), where a question-passage pair is composed of a singlesentence question and a five-sentence passage. The label of each question-answer pair (i.e., correct answer and incorrect answer) was manually annotated (See Oh et al. (2013) for more details). Oh et al. (2013) selected 10,000 questionpassage pairs as training and test data in 10-fold cross-validation (9,000 for training and 1,000 for testing) and used the remainder (7,000 questionpassage pairs) as additional training data during the 10-fold cross-validation. We followed the settings and, in each fold, we selected 1,000 pairs from the 9,000 pairs for training to use as development data for tuning hyperparameters. Note that there are no shared questions in the training, development, or test data.
For training the AGR, we used CmpAns, the training data set created in Ishida et al. (2018) for compact-answer generation; CmpAns consists of 15,130 triples of a why-question, an answer passage, and a manually-created compact answer. These cover 2,060 unique why-questions. Note that there was no overlap between the questions in CmpAns and those in W hySet. CmpAns was created in the following manner: 1) human annotators manually came up with open-domain why-questions, 2) retrieved the top-20 passages for each why-question using the open-domain why-QA module of a publicly available web-based QA system WISDOM X (Mizuno et al., 2016;, and 3) three annotators created (when possible) a compact answer for each of the retrieved passages. The passages for which no annotator could create a compact answer were discarded, and were not included in the 15,130 triples mentioned previously. The average lengths of questions, passages, and compact answers in CmpAns were 10.5 words, 184.4 words, and 8.3 words, respectively.
Finally, we created additional training data AddT r for training the why-QA models. If an annotator could write a compact answer for a question and an answer passage, she/he probably recognized the passage as a proper answer passage to the question. Based on this observation, we built AddT r from CmpAns by applying a majority vote. We only gave a correct answer label to a question and a passage if at least two of the three annotators wrote compact answers, and it received an incorrect answer label otherwise. AddT r has 10,401 pairs in total. We used AddT r as additional training data for baselines that lack a mechanism for generating compact-answer representations, for a fair comparison with other methods that use CmpAns for such mechanisms.
We processed all the data with MeCab 1 , a morphological analyzer, to segment the words.

Training details
In our proposed methods and their variants, all the weights in the CNNs were initialized using He's method (He et al., 2015), and the other weights in our why-QA model were initialized randomly with a uniform distribution in the range of (-0.01, 0.01). For the CNN-based components, we set the window size of the filters to "1,2,3" with 100 filters each 2 . We used dropout (Srivastava et al., 2014) with probability 0.5 on the final logistic regression layer. All of these hyper-parameters were chosen with our development data. We optimized the learned parameters with the Adam stochastic gradient descent (Kingma and Ba, 2015). The learning rate was set to 0.001, and the batch size for each iteration was set to 20.

Compared methods
We tried three schemes for training our AGR in our proposed method. In the first scheme, pairs of passages and compact answers in CmpAns were given to fake-representation generator F and realrepresentation generator R as their inputs. We called the fake-representation generator trained in this way F OP and referred to our proposed method using F OP as Ours(OP). In the second scheme, we randomly sampled five-sentence passages that contain some clue words indicating the existence of causal relations, such as "because," from 4-billion web pages and fed them to fakerepresentation generator F . We fed the same number of the sampled passages as in CmpAns for fair comparison. We refer to the method trained by this scheme as Ours(RP). In the final scheme, we replaced the word embeddings for the passages given to fake-representation generator F with random vectors and used similarity-attention but not causality-attention. The fake-representation generator trained in this way is called F RV , and our proposed method using F RV is called Ours(RV). This scheme is more similar to the original GAN than the others because the fake-representation generator is given random noises.
We implemented and evaluated the following four why-QA models in previous works as baselines, using the same dataset as ours: We also evaluated nine baseline neural models, four of which are BERT-based models (BERT, BERT+AddTr, BERT+F OP , and BERT+F RV ), to show the effectiveness of our why-QA model and AGR. They are listed in Table 2.

BASE
Proposed method from which we removed fake-representation generator F .

BASE+AddTr BASE that used both W hySet and
AddT r as its training data.

BASE+CAns
On top of BASE, it additionally used real-representation generator R to encode compact answers, which were generated by the compactanswer generator of Iida et al. (2019). R was trained alongside the why-QA model using W hySet and the compact-answer generator was pre-trained with CmpAns.

BASE+CEnc
On top of BASE, it additionally used the encoder in the compact-answer generator of Iida et al. (2019) to create compact-answer representation.
The encoder was pre-trained with CmpAns.

BASE+Enc
Same as Ours(OP) except that the fake-representation generator was trained in a supervised manner alongside the why-QA model using W hySet and AddT r as the training data.
BERT Same as BASE except that the CNNbased encoders for questions and passages were replaced with the BERT (Devlin et al., 2019).
BERT+AddTr BERT, which used both W hySet and AddT r as its training data.

BERT+FOP
On top of BERT, it additionally used compact-answer representation produced by FOP for answer selection.

BERT+FRV
Same as BERT+FOP except that it used FRV instead of FOP for producing compact-answer representation. To pre-train the BERT-based models, we used a combination of sentences extracted from Japanese Wikipedia articles (August 2018 version) and causality expressions automatically recognized from a causality recognizer (Oh et al., 2013). This data mix consists of 75% of sentences extracted from Wikipedia (14,675,535 sentences taken out of 784,869 articles randomly sampled) and 25% of cause and effect phrases taken from causality expressions (4,891,846 phrases from 2,445,923 causal relations). This ratio was determined through preliminary experiments using the development data. For the pre-training parameters, we followed the settings of BERT BASE in Devlin et al.  Figure 3: Architecture of BERT: E represents input embedding and T i represents contextual representation of token i.
[CLS] is a special symbol added in front of every input example, and [SEP] is a special separator token (e.g. for separating questions/passages).
(2019) 3 except for the batch size of 50. We ran 3 epochs with the learning rate of 1e-5 for finetuning the BERT-based models 4 . A BERT-based model, BERT, takes a questionpassage pair as input and computes the input representation using token, segment, position, and attention feature embeddings (Fig. 3). For the input representation computation, the original BERT only used the token, segment, and position embeddings, while BERT additionally used the attention feature embeddings 5 to exploit the same similarity-attention and causalityattention features used in our proposed method. We used the attention feature embeddings during the fine-tuning and testing, but not during the pretraining of the BERT-based model. The attention feature embeddings for answer passages (i.e., E sim w 1 , · · · , E sim w M , and E caus w 1 , · · · , E caus w M ) were computed from the same attention feature vectors, a s and a c , as those in our proposed methods; those for the other parts (i.e., questions, [CLS], and [SEP]) were computed from a zero vector (indicating no attention feature). The transformer encoder processed the input representation to gen-  Table 3: Why-QA performances erate a representation for a question-passage pair and passed the generated representation to the logistic regression layer for answer selection. BERT was trained with the training data in W hySet. BERT+AddTr is the same as BERT except that it additionally used AddT r as training data. On top of BERT, BERT+F OP and BERT+F RV additionally used the compact-answer representation produced by our fake-representation generator for answer selection by giving it to the final logistic regression layer. Table 3 shows the performances of all the methods in the Precision of the top answer (P@1) and the Mean Average Precision (MAP) (Oh et al., 2013). Note that the Oracle method indicates the performance of a fictional method that ranks the answer passages perfectly, i.e., it locates all the m correct answers to a question in the top-m ranks, based on the gold-standard labels. This performance is the upper bound of those of all the implementable methods. Our proposed method, Ours(OP), outperformed all the other methods. Our starting point, i.e., BASE, was already superior to the methods in the previous works. Compared with BASE and BASE+AddTr, neither of which used compactanswer representations or fake-representation generator F , Ours(OP) gave 3.4% and 2.8% improvement in P@1, respectively. It also outperformed BASE+CAns and BASE+CEnc, which generated compact-answer representations in a way different from the proposed method, and BASE+Enc, which trained the fake-representation generator without adversarial learning. These performance differences were statistically significant (p < 0.01 by the McNemar's test).

Results
Ours(OP) also outperformed all the BERTbased models but an interesting point is that fakerepresentation generator F boosted the performance of the BERT-based models (statistically significant with p < 0.01 by the McNemar's test). These results suggest that AGR is effective in both our why-QA model and our BERT-based model.

Deeper analysis on the output of F
Another interesting point is that Ours(RV), in which fake-representation generator F RV was trained using random vectors, achieved almost the same performance as that of Ours(OP). This result was puzzling, so we first checked whether F RV 's output was not just random noise (which could prevent the why-QA model from overfitting) by replacing in Ours(RV) the output of F RV by random vectors. Although we sampled the random vectors from different distribution types with various ranges, we obtained at best similar performance to that of BASE: 51.6 in P@1. This result confirms that it is not trivial to mimic F RV using random vectors at least.
We investigated the F RV 's output to check whether it actually focused on the compact answer in a given passage. We computed the following three representation sets from a gold set of 3,608 triples of why-questions, answer passages and manually created compact-answers that do not overlap with CmpAns: • {r org i }: F RV 's output with the pairs of a whyquestion and an answer passage in the gold set as its input; • {r in i }: F RV 's output for the same input as {r org i }, where we replaced the word embeddings of all the content words in the answer passages that also appeared in the associated gold compact-answers with random vectors; • {r out i }: F RV 's output for the same input as {r org i }, where we replaced the word embeddings of all the content words in the answer passages that did not appear in the associated gold compact-answers with random vectors 6 .
If F RV perfectly focuses on the gold standard compact-answers, for each question-passage pair, 6 For both r in i and r out i , we never replaced the word embeddings for the words that also appeared in the question.  focused equally on every answer passage word. However, the actual results suggest that this is not the case. Although we cannot draw decisive conclusions due to the complex nature of neural networks, we believe from the results that F RV does actually focus more on words that are a part of a compact answer than on other words. We also computed {r org i }, {r in i }, and {r out i } with fakerepresentation generator F OP in the same way and observed the same tendency.

DS-QA Experiments
We tested our framework on another task, the distantly supervised open-domain question answering (DS-QA) task (Chen et al., 2017), to check its generalizability. Table 4 shows the statistics for the datasets used in this experiment. The first three, Quasar-T, SearchQA, and TriviaQA provided by Lin et al. (2018), were used for training and evaluating DS-QA methods. The training data of SQuAD v1.1 (Rajpurkar et al., 2016) was used for training our AGR. The SQuAD dataset consisted of the triples of a question, an answer, and a paragraph that includes the answer. We assume that the answers are our compact answers, although the answers in the dataset are consecutive short word sequences (2.8 words on average), whose majority are noun phrases, unlike the compact answers for our why-QA experiment, i.e., sentences or phrases (8.3 words on average).
We trained our AGR with all the triples of a question, an answer, and a paragraph in the training data of SQuAD-v1.1 under the same settings for the AGR's hyperparameters as in our why-QA experiment except that we use neither causal word embeddings nor causality-attention.
In this experiment, we used the AGR training schemes for Ours(OP) and Ours(RV). We used the 300-dimensional GloVe word embeddings learned from 840 billion tokens in the web crawl data (Pennington et al., 2014), as general word embeddings. Then we combined the resulting fake-representation generator F in the AGR with the state-of-the-art DS-QA method, OpenQA (Lin et al., 2018) 7 . We also used the hyperparameters presented in Lin et al. (2018).
OpenQA is composed of two components: a paragraph selector to choose relevant paragraphs (or answer passages in our terms) from a set of paragraphs and a paragraph reader to extract answers from the selected paragraphs. For identifying answer a to given question q from set of paragraphs P = {p i }, the paragraph selector and the paragraph reader respectively compute probabilities P r(p i |q, P ) and P r(a|q, p i ), and final output P r(a|q, P ) is obtained by combining the probabilities. We introduced c i , which is a compact-answer representation generated by fakerepresentation generator F with question q and paragraph p i as its input, to the computation of the probabilities as follows: P r(a|q, P, C) = i P r(a|q, p i , c i )P r(p i |q, P, c i ) In the original OpenQA, the paragraph selector and the reader use bidirectional stacked RNNs for encoding paragraphs, where word embeddings p i of a paragraph is used as the input. In our implementation, we computed attention-weighted embeddingp i of a paragraph by using compactanswer representation c i . Given word embedding p j i for the j-th word in paragraph p i , its attentionweighted embeddingp j i was computed by using a bilinear function (Sutskever et al., 2009): is a trainable matrix, softmax j (x) denotes the j-th element of the softmaxed vector of x, and d = 300. We gave [p j i ;p j i ], a concatenation of p j i andp j i , as the word embedding of the j-th word in paragraph p i to the bidirectional stacked RNNs.  Table 5: DS-QA performances: Result of TriviaQA dataset is on its development data. § and † respectively indicate statistical significance of difference in performance between Ours(·) and OpenQA by the McNemar's test with p < 0.05 and p < 0.01 Table 5 shows the performances of the four DS-QA methods: R 3 (Wang et al., 2018), OpenQA (Lin et al., 2018), Ours(OP), and Ours(RV) evaluated against the Quasar-T, SearchQA and TriviaQA datasets. All the methods were evaluated with EM and F1 scores, following Lin et al. (2018). EM measures the percentage of predictions that exactly match one of the ground-truth answers and F1 is a metric that loosely measures the average overlap between the prediction and ground-truth answer. Note that both Ours(OP) and Ours(RV) outperformed both previous methods, R 3 and OpenQA, except for the F1 score for the TriviaQA dataset. Some of the improvements over the previous state-ofthe-art method, OpenQA, were statistically significant. These findings suggest that our framework can be effective for tasks other than the original why-QA and the other datasets.

Conclusion and Future Work
We proposed a method for why-question answering (why-QA) that used an adversarial learning framework. It employed adversarial learning to generate vector representations of reasons or true answers from answer passages and exploited the representations for judging whether the passages are proper answer passages to the given whyquestions. Through experiments using Japanese why-QA datasets, we showed that this idea improved why-QA performance. We also showed that our method improved the performance in a distantly supervised open-domain QA task.
In our why-QA method, causality expressions extracted from the web were used as background knowledge for computing causality-attention/embeddings. As a future work, we plan to introduce a wider range of background knowledge including another type of event causality (Hashimoto et al., , 2014(Hashimoto et al., , 2015.