CNN for Text-Based Multiple Choice Question Answering

The task of Question Answering is at the very core of machine comprehension. In this paper, we propose a Convolutional Neural Network (CNN) model for text-based multiple choice question answering where questions are based on a particular article. Given an article and a multiple choice question, our model assigns a score to each question-option tuple and chooses the final option accordingly. We test our model on Textbook Question Answering (TQA) and SciQ dataset. Our model outperforms several LSTM-based baseline models on the two datasets.


Introduction
Answering questions based on a particular text requires a diverse skill set.It requires look-up ability, ability to deduce, ability to perform simple mathematical operations (e.g. to answer questions like how many times did the following word occur?), ability to merge information contained in multiple sentences.This diverse skill set makes question answering a challenging task.
Question Answering (QA) has seen a great surge of more challenging datasets and novel architectures in recent times.Question Answering task may require the system to reason over few sentences (Weston et al., 2015), table (Pasupat and Liang, 2015), Wikipedia passage (Rajpurkar et al., 2016;Yang et al., 2015), lesson (Kembhavi et al., 2017).Increase in the size of the datasets has allowed researchers to explore different neural network architectures (Chen et al., 2016;Cui et al., 2016;Xiong et al., 2016;Trischler et al., 2016) for this task.Given a question based on a text, the model needs to attend to a specific portion of the text in order to answer the question.Hence, the use of attention mechanism (Bahdanau et al., 2014) is common in these architectures.
Convolutional Neural Networks (CNN) have been shown to be effective for various natural language processing tasks such as sentiment analysis, question classification etc. (Kim, 2014).However for the task of question answering, Long Short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) based methods are the most common.In this paper we build a CNN based model for multiple choice question answering1 .We show the effectiveness of the proposed model by comparing it with several LSTM-based baselines.
The main contributions of this paper are (i) The proposed CNN model performs comparatively or better than LSTM-based baselines on two different datasets.(Kembhavi et al., 2017;Welbl et al., 2017) (ii) Our model takes question-option tuple to generate a score for the concerned option.We argue that this is a better strategy than considering questions and options separately for multiple choice question answering.For example, consider the question "The color of the ball is" with three options: red, green and yellow.If the model generates a vector which is to be compared with the three option embeddings, then this might lead to error since the three option embeddings are close to each other.(iii) We have devised a simple but effective strategy to deal with questions having options like none of the above, two of the above, all of the above, both (a) and (b) etc. which was not done before.(iv) Instead of attending on words present in the text, our model attends at sentence level.This helps the model for answering look-up questions since the necessary information required to answer such questions will often be contained in a single sentence.

Method
Given a question based on an article, usually a small portion of article is needed to answer the concerned question.Hence it is not fruitful to give the entire article as input to the neural network.To select the most relevant paragraph in the article, we take both the question and the options into consideration instead of taking just the question into account for the same.The rationale behind this approach is to get the most relevant paragraphs in cases where the question is very general in nature.For example, consider that the article is about the topic carbon and the question is "Which of the following statements is true about carbon?".In such a scenario, it is not possible to choose the most relevant paragraph by just looking at the question.We select the most relevant paragraph by word2vec based query expansion (Kuzi et al., 2016) followed by tf-idf score (Foundation, 2011).

Neural Network Architecture
We use word embeddings (Mikolov et al., 2013) to encode the words present in question, option and the most relevant paragraph.As a result, each word is assigned a fixed d-dimensional representation.The proposed model architecture is shown in Figure 1.Let q, o i denote the word embeddings of words present in the question and the i th option respectively.Thus, q ∈ R d×lq and o i ∈ R d×lo where l q and l o represent the number of words in the question and option respectively.The question-option tuple (q, o i ) is embedded using Convolutional Neural Network (CNN) with a convolution layer followed average pooling.The convolution layer has three types of filters of sizes f j ×d ∀j = 1, 2, 3 with size of output channel of k.Each filter type j produces a feature map of shape (l q + l o − f j + 1) × k which is average pooled to generate a k-dimensional vector.The three kdimensional vectors are concatenated to form 3kdimensional vector.Note that Kim (2014) used max pooling but we use average pooling to ensure different embedding for different question-option tuples.Hence, where n q is the number of options, h i is the output of CNN and [q; o i ] denotes the concatenation of q and o i i.e. [q; o i ] ∈ R d×(lq+l 0 ) .The sentences in the most relevant paragraph are embedded using the same CNN.Let s j denote the word embeddings of words present in the j th sentence i.e. s j ∈ R d×ls where l s is the number of words in the sentence.Then, where n sents is the number of sentences in the most relevant paragraph and d j is the output of CNN.The rationale behind using the same CNN for embedding question-option tuple and sentences in the most relevant paragraph is to ensure similar embeddings for similar questionoption tuple and sentences.Next, we use h i to attend on the sentence embeddings.Formally, (3) where ||.|| signifies the l 2 norm, exp(x) = e x and h i • d j is the dot product between the two vectors.Since a ij is the cosine similarity between h i and d j , the attention weights r ij give more weighting to those sentences which are more relevant to the question.The attended vector m i can be thought of as the evidence in favor of the i th option.Hence, to give a score to the i th option, we take the cosine similarity between h i and m i i.e.
Finally, the scores are normalized using softmax to get the final probability distribution.
where p i denotes the probability for the i th option.

Dealing with forbidden options
We refer to options like none of the above, two of the above, all of the above, both (a) and (b) as forbidden options.During training, the questions Figure 1: Architecture of our proposed model.Attention layer attends on sentence embeddings d j 's using question-option tuple embeddings h i 's.Score Calculation layer calculates the cosine similarity between m i and h i which is passed through softmax to get the final probability distribution.
having a forbidden option as the correct option were not considered.Furthermore, if a question had a forbidden option, that particular questionoption tuple was not taken into consideration.Let S = [score i ∀i | i th option not in forbidden options] and |S| = k.During prediction, the questions having one of the forbidden options as an option are dealt with as follows: 1. Questions with none of the above/ all of the above option: If the max(S) − min(S) < threshold then the final option is the concerned forbidden option.
Else, the final option is argmax(p i ).
2. Questions with two of the above option: If the S (k) −S (k−1) < threshold where S (n) denotes the n th order statistic, then the final option is the concerned forbidden option.Else, the final option is argmax(p i ).
3. Questions with both (a) and (b) type option: For these type of questions, let the corresponding scores for the two options be score i 1 and score i 2 .If the |score i 1 − score i 2 | < threshold then the final option is the concerned forbidden option.
Else, the final option is argmax(p i ).

Questions with any of the above option:
Very few questions had this option.In this case, we always choose the concerned forbidden option.
We tried different threshold values ranging from 0.0 to 1.0.Finally, the threshold was set to a value gave the highest accuracy on the training set for these kind of questions.

Training Details
We tried two different CNN models, one having f j 's equal to 3,4,5 and other having f j 's equal to 2,3,4.We refer to two models as CN N 3,4,5 and CN N 2,3,4 respectively.The values of hyperparameters used are: d = 300, k = 100.The other hyperparamters vary from dataset to dataset.Since the number of options vary from question to question, our model generates the probability distribution over the set of available options.Similarly, the number of sentences in the most relevant paragraph can vary from question to question, so we set a ij = −∞ whenever d j was a zero vector.Cross entropy loss function was minimized during training.

Results and Discussion
The accuracy of our proposed model on validation set of TQA and SciQ dataset (Kembhavi et al., 2017;Welbl et al., 2017) is given in Table 1 and  Table 2. GRU bl refers to the model where CNN is replaced by Gated Recurrent Unit (GRU) (Cho et al., 2014) to embed question-option tuples and the sentences.The size of GRU cell was 100.
For SciQ dataset, we used the associated passage provided with the question.AS Reader (Kadlec et al., 2016) which models the question and the paragraph using GRU followed by attention mechanism got 74.1% accuracy on the SciQ test set.However, for a question, they used a different corpus to extract the text passage.Hence it is not judicious to compare the two models.1: Accuracy for true-false and multiple choice questions on validation set of TQA dataset.can be seen from the Tables 1 and 2, CN N 2,3,4 gives the best performance on the validation set of both the datasets so we evaluate it on the test sets.Note that GRU bl highly overfits on the SciQ dataset which shows that CNN-based models work better for those datasets where long-term dependency is not a major concern.This rationale is also supported by the fact that CN N 2,3,4 performed better than CN N 3,4,5 on the two datasets.

Model
Accuracy Table 2: Accuracy of the models on SciQ dataset.The first three accuarcies are on validation set.The last accuracy is of CN N 2,3,4 model on the test set.
Baselines for TQA dataset: Three baselines models are mentioned in Kembhavi (2017) .These baseline models rely on word-level attention and encoding question and options separately.The baseline models are random model, Text-Only model and BIDAF Model (Seo et al., 2016).Text-Only model is a variant of Memory network (Weston et al., 2014) where the paragraph, question and options are embedded separately using LSTM followed by attention mechanism.In BIDAF Model, character and word level embedding is used to encode the question and the text followed by bidirectional attention mechanism.This model predicts the subtext within the text containing the answer.Hence, the predicted subtext is compared with each of the options to select the final option.
Note that the result of the baseline models given in Kembhavi (2017) were on test set but the authors had used a different data split than the publicly released split.As per the suggestion of the authors, we evaluate CN N 2,3,4 model by combin-ing validation and test set.The comparison with the baseline models is given in Table 3.As can be seen from Table 3, CN N 2,3,4 model shows significant improvement over the baseline models.We argue that our proposed model outperforms the Text-Only model because of three reasons (i) sentence level attention, (ii) questionoption tuple as input, and (iii) ability to tackle forbidden options.Sentence level attention leads to better attention weights, especially in cases where a single sentence suffices to answer the question.If question is given as input to the model, then the model has to extract the embedding of the answer whereas giving question-option tuple as input simplifies the task to comparison between the two embeddings.
SciQ dataset didn't have any questions with forbidden options.However, in the validation set of TQA, 433 out of 1530 multiple choice questions had forbidden options.Using the proposed threshold strategy for tackling forbidden options, CN N 2,3,4 gets 188 out of 433 questions correct.Without using this strategy and giving every question-option tuple as input, CN N 2,3,4 gets 109 out of 433 questions correct.

Conclusions and Future Work
In this paper, we proposed a CNN based model for multiple choice question answering and showed its effectiveness in comparison with several LSTMbased baselines.We also proposed a strategy for dealing with forbidden options.Using questionoption tuple as input gave significant advantage to our model.However, there is a lot of scope for future work.Our proposed model doesn't work well in cases where complex deductive reasoning is needed to answer the question.For example, suppose the question is "How much percent of parent isotope remains after two half-lives?" and the lesson is on carbon dating which contains the definition of half-life.Answering this question using the definition requires understanding the definition and transforming the question into a numerical problem.Our proposed model lacks such skills and will have near random performance for such questions.

Table 3 :
Kembhavi (2017)ferent models for true-false and multiple choice questions.Results marked with ( * ) are taken fromKembhavi (2017)and are on test set obtained using a different data split.Result of our proposed model is on publicly released validation and test set combined.