Which is the Effective Way for Gaokao: Information Retrieval or Neural Networks?

As one of the most important test of China, Gaokao is designed to be difficult enough to distinguish the excellent high school students. In this work, we detailed the Gaokao History Multiple Choice Questions(GKHMC) and proposed two different approaches to address them using various resources. One approach is based on entity search technique (IR approach), the other is based on text entailment approach where we specifically employ deep neural networks(NN approach). The result of experiment on our collected real Gaokao questions showed that they are good at different categories of questions, that is IR approach performs much better at entity questions(EQs) while NN approach shows its advantage on sentence questions(SQs). We achieve state-of-the-art performance and show that it’s indispensable to apply hybrid method when participating in the real-world tests.


Introduction
Gaokao, namely the National College Entrance Examination, is the most important examination for Chinese senior high school students. Every college in China, no matter it is Top10 or Top100, would only accept the exam-takers whose Gaokao score is higher than its threshold score. As there are almost 10 million students take the examination every year, Gaokao needs to be difficult enough to distinguish the excellent students. Therefore, it includes various types of questions such as multiple-choice questions, short-answer † Both of the two authors contributed equally to this paper. questions and essays and it covers several different subjects, like Chinese, Math, History and etc. In this work, we focus on Gaokao History Multiple Choice questions which is denoted as GKHMC. Both of the factoid question answering task and reading comprehension task are similar to GKHMC. But, the GKHMC questions have their own characteristics. A multiple-choice question in GKHMC such as the examples shown in Figure 1 is composed of a question stem and four candidates. Our goal is to figure out the only one correct candidate. But, there are certain obstacles to achieve it. First, several background sentencess and a lead-in sentence conjointly constitutes the question stem, which makes these questions more complicated than former one-sentence-long factoid questions that can be handled by the existing approaches, like (Kolomiyet and Moens, 2011;Kwiatkowski et al., 2013;Berant and Liang, 2014;Yih et al., 2015). Secondly, the background sentences generally contain various clues to figure out the historical events or personages which may be the perdue key to answer the question. These clues may include Tang poem and Song iambic verse, domainspecific expressions, even some mixture of mod-  ern Chinese and excerpt from ancient books and etc. The dependence of background knowledge makes the models that are designed for reading comprehension such as (Peñas et al., 2013;Richardson et al., 2013) fail. Thirdly, the diversity of candidates' granularity, i.e. candidates can either be entities or sentences, makes it harder to match the candidate and stem. So, the answer selection is disparate from the former approaches whose candidates are usually just entities. Lastly, as the candidates are already given, the answer generation step in former neural network approaches based question answering system is no longer necessary.
As mentioned above and shown in Figure  1, in accordance with candidates' granularity, the GKHMC questions can be divided into two types: entity questions(EQs) and sentence questions(SQs). Entity questions are those whose candidates are all entities, no matter they are people, dynasties, warfares or something else. And, sentence questions are those whose candidates are all sentences. We observe that such two types of questions have their own specific characteristics. Most of background sentences in EQs are description of the right candidate, so it may be particularly suitable to apply information retrieval like approach to handle them. Meanwhile, as the background sentences and lead-in sentences in SQs are more like the entailing text, these questions aren't appropriate to be addressed by lexically searching and matching. Therefore, it seems that it's more resonable to resolve SQs by using textual reasoning techniques.
In this paper, we wonder about which kind of approach is more effective for GKHMC. Furthermore, whether we should select specific method to work out different types of questions. In terms of various characteristics of GKHMC questions, we introduce two independent approaches to address them. One is based on entity search technique (IR approach) and the other is based on a text entailment approach where we specifically employ deep neural networks (NN approach). In IR approach, we use the key entities and relationships extracted from questions to form a query, then inquire this query in all the text resources to get the most relevant candidate. In NN approach, we take the question text and every candidate to form four statements respectively, then judge how possible every statement is right so that we can figure out which is most likely to be the correct answer.
To test the two approaches' performance, we collected and classified the multiple-choice questions in Gaokao test papers from 2011 to 2015 all over the country, and they are released. From the result, we find that the performance of two approaches are significantly discrepant at each kind of questions. That is, IR approach shows noticeable advantages on EQs, while NN approach performs much better on SQs. This will be further discussed in Section 4.4.
In this paper, our contributions are as follows: • We gave a detailed description of the Gaokao History Multiple Choice Questions task and showed its importance and difficulty.
• We released a dataset 1 for this task. The dataset is manually collected and classified. All questions in the dataset are real Gaokao quesitons from 2011 to 2015.
• We introduced two different approaches for this task. Each approach achieved a promising results. We also compared this two approaches and found that they are complementary, i.e. they are good at different types of questions.
• We introduced permanent provisional memory network(PPMN) to model the joint background knowledge and sentences in question stem, and it beats existing memory networks on SQs.

Dataset
As described in the Introduction, we collected the historical multiple-choice questions from Gaokao all over the country in rencent five years. However, quite a lot contain graphs or tables which require the techniques beyond natural language processing(NLP). So, we filter out this part of questions and manually classified the left into two parts: EQs and SQs. The number of different kinds of questions are listed in Table 1. The examples of different types of questions translated into English are shown in Figure 1. It is worth mentioning that there is a special type of questions on test papers named sequential questions. The candidates of this kind of questions are just some ordered numbers. Every number stands for a certain content which is given in question stem. We simply replace every sequential number in candidates with their corresponding contents. Then, we can classify these questions as EQs or SQs according to the type of contents.
We also collected a wide diversity of resources including Baidu Encyclopedia, textbooks and practice questions as our external knowledge when inquiring the generated query. Baidu Encyclopedia which is also known as Baidu Baike, is something like Wikipedia, but the content of it is written in Chinese. We denote this resource as BAIKE. The textbooks resource contains three compulsory history textbooks published by People's Education Press. We denote them as BOOK.
And we gathered about 50,000 practice questions and their answers, and this is denoted as TIKU.

IR Approach
The GKHMC questions require figuring out the most relevant candidate to the question stem from the four given candidates. Our IR approach is inspired by this observation. The diagram of IR approach is illustrated in Figure 2.
The pipeline of IR approach is: (1) use the classifier to automatically classify the question and select the weights according to the classification result; (2) calculate the relevance scores for every candidate(we introduce three different methods with seven score functions to calculate the relevance scores) and combine them together with specific weights; (3) choose the candidate with highest score as right answer. Despite the simplicity of it, IR approach achieves a promising result in experiment.

Naive Bayes Classifier
We build a naive bayes classifier to classify questions. Using length of candidates, entity number of candidates and verb number of candidates as features, every question is classified as EQ or SQ. When building the classifier, we do 10-folder cross 2 The codes of this project can be obtained at https://github.com/IACASNLPIR/GKHMC Figure 2: Pipeline of IR approach.
validation on the GKHMC dataset and the results are 90.00% precision and 84.38% recall in EQs and 95.79% precision and 97.43% recalls in SQs.

Score Functions
To calculate the relevance between question stem and candidates, we introduce three different methods with seven score functions, which are summarized in Table 2. Lexical Matching Score: Since the correct candidate usually directly related to question stem, it's reasonable to assume that the facts in question stem may appear in documents related to them, together with the correct candidate. Here we introduce our lexical matching score functions, taking BAIKE as our external resource. The four queries are formed by each candidate and question stem separately. Then we retrieval every query and sum up the scores of the top three returned documents as the lexical matching score. We use score top i to denote the score of the top ith returned documents. score top i is calculated by Lucene's TFIDFSimilarity function 3 . The lexical matching score Score lexical (candidate k ) is calculated as (1) We build indices for BAIKE with different grains. The index built for every BAIKE document is denoted as BAIKE Document Index(BDI). The index built for every paragraph in BAIKE is denoted as BAIKE Paragraph Index(BPI). And, the index built for every sentence in BAIKE is called BAIKE Sentence Index(BSI).
We denote the lexical matcing score function using BDI, BPI and BSI as Score BDI , Score BP I and Score BSI respectively. Entity Co-Occurrence Score: We also consider the relevance of entities in co-occurrence aspect. If two entities often appearing together, we assume that they are revelent. We use normalized google distance (Cilibrasi and Vitanyi, 2007) to calculate the entity co-occurrence score Score co (candidate k ). where In which, e i is entity; f (e i ) is the number of parts which contain entity e i ; f (e i , e j ) is the number of parts which contain both entity e i and e j ; E stem and E candidate k denotes the entities in question stem and candidate.
The entity co-occurrence could be in document, paragraph or sentence, and they are donated as Score BDC , Score BP C and Score BSC respectively. Page Link Score: Inspired from PageRank algorithm (Page et al., 1999), we assume that entities have links to each other are relevant. Here we introduce the page link score function. We use Link(e i , e j ) to denote the number of links between entities e i and e j . The link score Score link (candidate k ) could be calculated as: where e i ∈ E stem , e j ∈ E candidate k .
We only count the number of links between BAIKE documents, and it is denoted as Score BDL

Function
Description Score BDI Score lexical using BDI Score BP I Score lexical using BPI Score BSI Score lexical using BSI Score BDC document level Score co Score BP C paragraph level Score co Score BSC sentence level Score co Score BDL document link score function Table 2: Summarization of score functions.

Training Weights
Since we have seven score functions, we need combine them together with different weights.
For a given question, we calculate the score of every candidate as follows: where k ∈ {1, 2, 3, 4}, f i is one of the seven score functions and w i is the corresponding weight. Then we normalize the scores of all candidates: We suppose that the true answer of a question is the n-th candidate, where n ∈ {1, 2, 3, 4}. The loss of it is Now we can calculate the total loss of the dataset with M questions: All operations are derivable so that we can use gradient descent algorithm to train the weights.

NN Approach
As deep neural networks are widely used in natural language processing tasks and has gained great success, it's naturally to come up with building deep neural networks to handle GKHMC task. So, we built several deep neural networks in different structures. And, we used both TIKU and BOOK to train these models, in order to teach models not only how to answer the questions but also the historical knowledge.
To handle the joint inference between background knowledge and question stems in GKHMC  e  p  u  b  l  i  c  s  g  r  a  d  u  a  l  l  y  b  e  g  a  n  c  o  m  p  e  t  i  n  g  i  n  a  l  l  f  i  e  l  d  s  i  n  c  l  u  d  i  n  g  p  o  l  i  t  i  c  s  ,  e  c  o  n  o  m  y  a  n  d  m  i  l  i  t  a  r  y  .  I  n  o  r  d  e  r  t  o  c  o  o  r  d  i  n  a  t  e  a  n  d  p  r  o  m  o  t  e  e  c  o  n  o  m  i  c  a  l  d  e  v  e  l  o  p  m  e  n  t  o  f  m  e  m  b  e  r  c  o  u  n  t  r  i  e  s  i  n  s  o  c  i  a  l  i  s  t  p  a  r  t  y  ,  U  S  S  R  e  s  t  a  b  l  i  s  h  e  d  C  o  u  n  c  i  l  f  o  r  M  u  t  u  a  l  E  c  o  n  o  m  i  c  A  s  s  i  s  t  a  n  c  e  i  n  1  9  4  9  .  T  h  i  s  m  o  v  e  i  s  m  a  i  n  l  y  f  o  r  c  o  n  f  r  o  n  t  i  n  g  M  a  r  s  h  a  l  l  P  l  a  n  .   A  f  t  e  r  W  W  Ⅱ  ,  t  h  e  U  n  i  t  e  d  S  t  a  t  e  s  a  n  d  t  h  e  U  n  i  o  n  o  f  S  o  v  i  e  t  S  o  c  i  a  l  i  s  t  R  e  p  u  b  l  i  c  s  g  r  a  d  u  a  l  l  y  b  e  g  a  n  c  o  m  p  e  t  i  n  g  i  n  a  l  l  f  i  e  l  d  s  i  n  c  l  u  d  i  n  g  p  o  l  i  t  i  c  s  ,  e  c  o  n  o  m  y  a  n  d  m  i  l  i  t  a  r  y  .  I  n  o  r  d  e  r  t  o  c  o  o  r  d  i  n  a  t  e  a  n  d  p  r  o  m  o  t  e  e  c  o  n  o  m  i  c  a  l  d  e  v  e  l  o  p  m  e  n  t  o  f  m  e  m  b  e  r  c  o  u  n  t  r  i  e  s  i  n  s  o  c  i  a  l  i  s  t  p  a  r  t  y  ,  U  S  S  R  e  s  t  a  b  l  i  s  h  e  d  C  o  u  n  c  i  l  f  o  r  M  u  t  u  a  l  E  c  o  n  o  m  i  c  A  s  s  i  s  t  a  n  c  e  i  n  1  9  4  9  .  T  h  i  s  m  o  v  e  i  s  m  a  i  n  l  y  f  o  r  c  o  n  f  r  o  n  t  i  n  g  M  a  r  s  h  a  l  l  P  l  a  n  .   A  f  t  e  r  W  W  Ⅱ  ,  t  h  e  U  n  i  t  e  d  S  t  a  t  e  s  a  n  d  t  h  e  U  n  i  o  n  o  f  S  o  v  i  e  t  S  o  c  i  a  l  i  s  t  R  e  p  u  b  l  i  c  s  g  r  a  d  u  a  l  l  y  b  e  g  a  n  c  o  m  p  e  t  i  n  g  i  n  a  l  l  f  i  e  l  d  s  i  n  c  l  u  d  i  n  g  p  o  l  i  t  i  c  s  ,  e  c  o  n  o  m  y  a  n  d  m  i  l  i  t  a  r  y  .  I  n  o  r  d  e  r  t  o  c  o  o  r  d  i  n  a  t  e  a  n  d  p  r  o  m  o  t  e  e  c  o  n  o  m  i  c  a  l  d  e  v  e  l  o  p  m  e  n  t  o  f  m  e  m  b  e  r  c  o  u  n  t  r  i  e  s  i  n  s  o  c  i  a  l  i  s  t  p  a  r  t  y  ,  U  S  S  R  e  s  t  a  b  l  i  s  h  e  d  C  o  u  n  c  i  l  f  o  r  M  u  t  u  a  l  E  c  o  n  o  m  i  c  A  s  s  i  s  t  a  n  c  e  i  n  1  9  4  9  .  T  h  i  s  m  o  v  e  i  s  m  a  i  n  l  y  f  o  r  c  o  n  f  r  o  n  t  i  n  g  M  a  r  s  h  a  l  l  P  l  a  n  .   A  f  t  e  r  W  W  Ⅱ  ,  t  h  e  U  n  i  t  e  d  S  t  a  t  e  s  a  n  d  t  h  e  U  n  i  o  n  o  f  S  o  v  i  e  t  S  o  c  i  a  l  i  s  t  R  e  p  u  b  l  i  c  s  g  r  a  d  u  a  l  l  y  b  e  g  a  n  c  o  m  p  e  t  i  n  g  i  n  a  l  l  f  i  e  l  d  s  i  n  c  l  u  d  i  n  g  p  o  l  i  t  i  c  s  ,  e  c  o  n  o  m  y  a  n  d  m  i  l  i  t  a  r  y  .  I  n  o  r  d  e  r  t  o  c  o  o  r  d  i  n  a  t  e  a  n  d  p  r  o  m  o  t  e  e  c  o  n  o  m  i  c  a  l  d  e  v  e  l  o  p  m  e  n  t  o  f  m  e  m  b  e  r  c  o  u  n  t  r  i  e  s  i  n  s  o  c  i  a  l  i  s  t  p  a  r  t  y  ,  U  S  S  R  e  s  t  a  b  l  i  s  h  e  d  C  o  u  n  c  i  l  f  o  r  M  u  t  u  a  l  E  c  o  n  o  m  i  c  A  s  s  i  s  t  a  n  c  e  i  n  1  9  4  9  .  T  h  i  s  m  o  v  e  i  s  m  a  i  n  l  y  f  o  r  c  o  n  f  r  o  n  t  i  n  g  M  a  r  s  h  a  l  l  P  l  a  n  .   A  f  t  e  r  W  W  Ⅱ  ,  t  h  e  U  n  i  t  e  d  S  t  a  t  e  s  a  n  d  t  h  e  U  n  i  o  n  o  f  S  o  v  i  e  t  S  o  c  i  a  l  i  s  t  R  e  p  u  b  l  i  c  s  g  r  a  d  u  a  l  l  y  b  e  g  a  n  c  o  m  p  e  t  i  n  g  i  n  a  l  l  f  i  e  l  d  s  i  n  c  l  u  d  i  n  g  p  o  l  i  t  i  c  s  ,  e  c  o  n  o  m  y  a  n  d  m  i  l  i  t  a  r  y  .  I  n  o  r  d  e  r  t  o  c  o  o  r  d  i  n  a  t  e  a  n  d  p  r  o  m  o  t  e  e  c  o  n  o  m  i  c  a  l  d  e  v  e  l  o  p  m  e  n  t  o  f  m  e  m  b  e  r  c  o  u  n  t  r  i  e  s  i  n  s  o  c  i  a  l  i  s  t  p  a  r    questions, we introduce permanent-provisional memory network(PPMN). As illuminated in Figure 3, our PPMN is composed by the following components: 1. Permanent Memory Module that plays the same role as a knowledge base and stores the original text from history textbooks or other relevant resource.
2. Provisional Memory Module that generates some contents based on the current word in background sentences, permanent knowledge and the lead-in sentence.
3. Input Module that reads the words sequentially in background sentences and maps them into high-dimensional vector space.
4. Similarity Judger that scores the similarity between the output of provisional memory and the vector representations of answer candidates.
5. Sentence Encoder that encodes lead-in sentence, sentences in permanent memory and answer candidates.
In the above equations, w t denotes the t-th word in the background sentences, GRU is defined in equation (19)(20)(21)(22), h t−1 and h t are the hidden representation of w t−1 and w t respectively, l stands for the lead-in sentence encoded by the sentence encoder, • is element-wise multiplication and m t is the computational result of current step. The final output of this module is the last provisional memory vector m n where n is the length of background sentences.
Input Module: This module takes the same weight matrices in sentence encoder and calculates the hidden states of every word sequentially. All the words in background sentences are first mapped into the hidden states in this module and then can be taken as input by other modules. The calculation of hidden states are the same as equation (19)(20)(21)(22).
Similarity Judger: This module takes the concatenation of the output from provisional memory and representation of answer candidate as input and use a classifier based on logistic regression to score it. The judging procedure is defined as follow:p where W l is a matrix that can map the concatenation vector [m K ; a] into a vectorp of length 2 and a stands for the answer candidate encoded by sentence encoder.
Sentence Encoder: We experimented several recurrent neural networks with different structures as the sentence encoder. Both of Long-Short Term Momery (LSTM) (Hochreiter and Schmidhuber, 1997) and Gated Recurrent Unit (GRU) (Cho et al., 2014) perform much better than the standard tanh RNN. However, considering that the computation of LSTM is more complicated and timeconsuming, we choose GRU as the sentence encoder. The calculation of GRU denoted as h t = GRU (w t , h t−1 ) is as follow: In the above equations, w t is extracted from a word embedding matrix W e initialized by word2vec (Mikolov et al., 2013) through an id number that indicates which word it is.
Loss Function: Intuitively, as we want to encourage the score as same to the true score (0 or 1) as possible, a negative log-likelihood loss function is introduced: where y would be [0 1] if a is the right answer or [1 0] otherwise.
Optimization Algorithm: We use the AdaDelta introduced by (Zeiler, 2012) to minimize the loss L, and use back propagation through time to optimize the calculation results of intermediate results.

Experiments of IR Approach
To find the best weights for EQ and SQ, We use TIKU as the training dataset. Using gradient descent to optimize parameters, we get the best weights for EQs and SQs separately, that is, W EQ is the weight best for EQs and W SQ is the weight best for SQs. We test the weights on EQs and SQs of GKHMC with their corresponding weights, and result is shown in Table 3. As we can see, with these weights, we achieve promising result. We use GKHMC as the dataset to test the performance of IR approach with naive bayes classifier. The precision of EQs and SQs are 48.75%, 28.42% respectively. It's clear that the accuracy of both EQs and SQs decreased with automatic classification. But still, IR approach achieves much better results on EQs than SQs.

Results of NN Approach
We take some other neural network models with memory capability as our baseline models including the standard tanh recurrent neural network(RNN), long-short term memory network(LSTM) (Hochreiter and Schmidhuber, 1997), gated recurrent unite(GRU) (Cho et al., 2014), end-to-end memory network(MemNN) (Sukhbaatar et al., 2015) and dynamic memory network(DMN) (Kumar et al., 2016). As for our PPMN, we summarize the syllabus of all history textbooks for senior school students to cover as much knowledge points as possible and we get 198 sentences which are taken into the permanent memory module. For all the above models, we used rmsprop (Hinton et al., 2012) with 0.001 as the learning rate to train them, the size of hidden units as well as the size of memory were both set to 400 and the size of batches were set to 1000. Also, we used dropout (Srivastava et al., 2014) to prevent the models from overfitting and the probability of it was set to 0.5. We test all these models and the results are shown in Table 4.
From the result, we observe that our PPMN  Table 4: Results of all neural network models.
gains best performance on all kinds of GKHMC questions and all memory-capable neural network models beat RNN. It's interesting that MemNN performs much worse than other memory-capable models on SQs whereas it shows promising capability on EQs.

Combine IR Approach and NN Approach
It can be easily observed from the above experiments that IR approach and NN approach are some kind of complementary, namely they performs better to each other on different categories of questions. So, we combine the two approaches together via a weights matrix W c ∈ R 2×2 as follows: where the W c i· means the i-th row of W c and score IR , score N N are the scores calculated by IR and NN approaches respectively. Here, the categories of questions are given by the naive bayes classifier. The performance of combined model and its comparison to the two individual approaches are illustrated in Figure 4.

Discussion
From the global aspect, it can be easily observed that IR approach are more proficient on EQs(49.38% vs 40.63%), whereas NN approach expand superior to it on SQs(28.60% vs 40.24%). And the hybrid method composed by two approaches get the best performance(42.60%).
As for the IR approach itself, the performance on EQs is much better than on SQs. This may because that IR approach is based on the relevance between candidates and question stem. In EQs, the information given by the question stem is usually the description of the key entity which only disappeared in the right candidate. So it's easy for the correct candidate to achieve a higher relevance score than others. And, that's why IR approach achieves promising result on EQs. Whereas, in SQs, the key entity doesn't appear in any candidate. And, it needs to be inferred out from question stem. No matter in aspect of lexical matching, entity co-occurrence or page link, the relevance between question stem and correct candidate may be as low as other candidates. Therefor, it's not surprised that IR approach is not sufficient to figure out the right choice on SQs. After adding the classifier in IR approach, we notice the decrease of accuracy on both EQs and SQs. This is because of the misclassification on the questions, which demonstrates that the weights W EQ , W SQ are particularly efficient on EQs, SQs.
The experiment of NN approach declared that our PPMN does show its advantages on GKHMC questions. During the training, the performance of RNN model is labile, i.e. the precision are still variational when loss is convergent. In contrast, other model's performance is more stable. Hence, we consider that the memory mechanism helps model to "remember" the knowledge that appeared in the training data. Compared with the "inside" 4 memory of LSTM and GRU, the specially designed memory component in MemNN, DMN and PPMN are more powerful to find out the relationships between the question stem and answer candidates in GKHMC questions. However, the limited performance of MemNN on SQs indicates that the sequences of words in GKHMC questions are especially important for questions containing no distinct entities. Last but not least, the best performance of PPMN may due highly on the novel permanent memory module which can helps finding the implicit relationships with the stored background knowledge.
The state-of-the-art performance of hybrid method indicates that combination of IR approach and NN approach is the best strategy to address the GKHMC questions. As illustrated in Figure  4, the combined method shows its enormous advantage on EQs. This may because both character and word embedding are more sufficient to cover the lexical meaning. And, some of EQs may be more suitable to be handled as SQs. Compared to the NN approach separately, the hybrid way does

Related Work
Answering real world questions in various subjects already gained attention from the beginning of this century. The ambitious Project Halo (Friedland et al., 2004) was proposed to create a "digital" Aristotle that can encompass most of the worlds's scientific knowledge and be capable of addressing complex problems with novel answers. In this project, (Angele et al., 2003) employed handcrafted rule to answer chemistry questions, (Gunning et al., 2010) took the physics and biology into account. Another important trial is solving the mathematical questions. (Mukherjee and Garain, 2008) attempted to answer them via transforming the natural language description into formal queries with hand-crafted rules, whereas recent works (Hosseini et al., 2014) started to employing learning techniques. However, none of these methods are suitable for history questions which requires large background knowledge, the same to the Aristo Challenge (Clark, 2015) focused on Elementary Grade Tests which is for 6-11 year olds. The Todai Robot Project (Fujita et al., 2014)aims to build a system that can pass the University of Tokyo's entrance examination. As parts of this project,  mainly focus on addressing the yes-no questions via determining the correctness of the original proposition, and  mainly focus on recognizing textual entailment between a description in Wikipedia and each option of question. But, these two methods are separated for different kinds of questions and none of them introduced neural network approach.
It's inevitable to compare the GKHMC with the factoid questions. (Berant and Liang, 2014) takes the question as a kind of semantic parsing which can not handle the specific expressions with lots of background knowledge. Although (Yih et al., 2015) employed knowledge base, but still failed on multiple sentences questions which is beyond the scope of semantic parsing. However, the diversity of candidates in GKHMC makes these models fail to match the question with the right candidate. Another nonnegligible task is machine comprehension, also called reading comprehension. Although in several different datasets introduced by (Smith et al., 2008;Richardson et al., 2013;, questions are open-domain and candidates may be entities or sentences, understanding these questions don't require as much background knowledge as in GKHMC and these models cannot handle the joint inference between the background knowledge and words in questions.
We are not the first to take up the Gaokao challenge, but former information retrieval approach doesn't fit to part of the questions in GKHMC and resources in their system are limited. In contrast, we introduced two different approaches to this task, compared their performance on different types of questions, combined them and gained a state-of-the-art result.

Conclusion and Future Work
In this work, we detailed the multiple choice questions in subject History of Gaokao, present two different approaches to address them and compared these approaches' performance on all categories of questions. We find that the IR approach are more sufficient on EQs cause the words in these questions are usually the description of right answer, whereas the NN approach performs much better on SQs, and this may because neural network models can find out the semantic relationship between questions and candidates. When combining them together, we get the state-of-the-art performance on GKHMC, better than any individual approach. This points out that combining different approaches may be a better method to deal with the real-world questions.
In future work, we will explore whether keyvalue memory network proposed by (Miller et al., 2016) can help improve the performance of PPMN, what content in textbook or encyclopedia should be taken into the permanent memory, how to mathematically organize the permanent mem-ory to make it can be reasoned on as well as whether transforming the knowledge described in natural language into formal representation is beneficial. As a long-term goal, it's necessary to introduce discourse analysis, semantic parsing to help the model truly understand the material sentences, questions and candidates.