Attention-Based Convolutional Neural Network for Machine Comprehension

Understanding open-domain text is one of the primary challenges in natural language processing (NLP). Machine comprehension benchmarks evaluate the system's ability to understand text based on the text content only. In this work, we investigate machine comprehension on MCTest, a question answering (QA) benchmark. Prior work is mainly based on feature engineering approaches. We come up with a neural network framework, named hierarchical attention-based convolutional neural network (HABCNN), to address this task without any manually designed features. Specifically, we explore HABCNN for this task by two routes, one is through traditional joint modeling of passage, question and answer, one is through textual entailment. HABCNN employs an attention mechanism to detect key phrases, key sentences and key snippets that are relevant to answering the question. Experiments show that HABCNN outperforms prior deep learning approaches by a big margin.


Introduction
Endowing machines with the ability to understand natural language is a long-standing goal in NLP and holds the promise of revolutionizing the way in which people interact with machines and retrieve information.Richardson et al. [2013] proposed the task of machine comprehension, along with MCTest, a question answering dataset for evaluation.The ability of the machine to understand text is evaluated by posing a series of questions, where the answer to each question can be found only in the associated text.Solutions typically focus on some semantic interpretation of the text, possibly with some form of probabilistic or logic inference, to answer the question.Despite intensive recent work [Weston et al., 2014;Weston et al., 2015;Hermann et al., 2015;Sachan et al., 2015], the problem is far from solved.
Machine comprehension is an open-domain questionanswering problem which contains factoid questions, but the answers can be derived by extraction or induction of key clues.Figure 1 shows one example in MCTest.Each example consists of one document, four associated questions; each question is followed by four answer candidates in which Figure 1: One example with 2 out of 4 questions in the MCTest."*" marks correct answer.only one is correct.Questions in MCTest have two categories: "one" and "multiple".The label means one or multiple sentences from the document are required to answer this question.To correctly answer the first question in the example, the two blue sentences are required; for the second question instead, only the red sentence can help.The following observations hold for the whole MCTest.(i) Most of the sentences in the document are irrelavent for a given question.It hints that we need to pay attention to just some key regions.(ii) The answer candidates can be flexible text in length and abstraction level, and probably do not appear in the document.For example, candidate B for the second question is "outside", which is one word and does not exist in the document, while the answer candidates for the first question are longer texts with some auxiliary words like "Because" in the text.This requires our system to handle flexible texts via extraction as well as abstraction.(iii) Some questions require multiple sentences to infer the answer, and those vital sentences mostly appear close to each other (we call them snippet).Hence, our system should be able to make a choice or compromise between potential single-sentence clue and snippet clue.
Prior work on this task is mostly based on feature engineering.This work, instead, takes the lead in presenting a deep neural network based approach without any linguistic features involved.
Concretely, we propose HABCNN, a hierarchical attention-based convolutional neural network, to address this task in two roadmaps.In the first one, we project the docu-ment in two different ways, one based on question-attention, one based on answer-attention and then compare the two projected document representations to determine whether the answer matches the question.In the second one, every question-answer pair is reformatted into a statement, then the whole task is treated through textual entailment.
In both roadmaps, convolutional neural network (CNN) is explored to model all types of text.As human beings usually do for such a QA task, our model is expected to be able to detect the key snippets, key sentences, and key words or phrases in the document.In order to detect those informative parts required by questions, we explore an attention mechanism to model the document so that its representation contains required information intensively.In practice, instead of imitating human beings in QA task top-down, our system models the document bottom-up, through accumulating the most relevant information from word level to snippet level.
Our approach is novel in three aspects.(i) A document is modeled by a hierarchical CNN for different granularity, from word to sentence level, then from sentence to snippet level.The reason of choosing a CNN rather than other sequence models like recurrent neural network [Mikolov et al., 2010], long short-term memory unit (LSTM [Hochreiter and Schmidhuber, 1997]), gated recurrent unit (GRU [Cho et al., 2014]) etc, is that we argue CNNs are more suitable to detect the key sentences within documents and key phrases within sentences.Considering again the second question in Figure 1, the original sentence "They sat by the fire and talked about he insects" has more information than required, i.e, we do not need to know "they talked about the insects".Sequence modeling neural networks usually model the sentence meaning by accumulating the whole sequence.CNNs, with convolutionpooling steps, are supposed to detect some prominent features no matter where the features come from.(ii) In the example in Figure 1, apparently not all sentences are required given a question, and usually different snippets are required by different questions.Hence, the same document should have different representations based on what the question is.To this end, attentions are incorporated into the hierarchical CNN to guide the learning of dynamic document representations which closely match the information requirements by questions.(iii) Document representations at sentence and snippet levels both are informative for the question, a highway network is developed to combine them, enabling our system to make a flexible tradeoff.
Overall, we make three contributions.(i) We present a hierarchical attention-based CNN system "HABCNN".It is, to our knowledge, the first deep learning based system for this MCTest task.(ii) Prior document modeling systems based on deep neural networks mostly generate generic representation, this work is the first to incorporate attention so that document representation is biased towards the question requirement.(iii) Our HABCNN systems outperform other deep learning competitors by big margins.

Related Work
Existing systems for MCTest task are mostly based on manually engineered features.Representative work includes  Barzilay, 2015;Sachan et al., 2015;Wang and McAllester, 2015;Smith et al., 2015].In these works, a common route is first to define a regularized loss function based on assumed feature vectors, then the effort focuses on designing effective features based on various rules.Even though these researches are groundbreaking for this task, their flexibility and their capacity for generalization are limited.
Deep learning based approaches appeal to increasing interest in analogous tasks.Weston et al., [2014] introduce memory networks for factoid QA.Memory network framework is extended in [Weston et al., 2015;Kumar et al., 2015] for Facebook bAbI dataset.Peng et al. [2015]'s Neural Reasoner infers over multiple supporting facts to generate an entity answer for a given question and it is also tested on bAbI.All of these works deal with some short texts with simple-grammar, aiming to generate an answer which is restricted to be one word denoting a location, a person etc.Some works also tried over other kinds of QA tasks.For example, Iyyer et al., [2014] present QANTA, a recursive neural network, to infer an entity based on its description text.This task is basically a matching between description and entity, no explicit question exist.Another difference with us lies in that all the sentences in the entity description actually contain partial information about the entity, hence a description is supposed to have only one representation.However in our task, the modeling of document should be dynamically changed according to the question analysis.Hermann et al., [2015] incorporate attention mechanism into LSTM for a QA task over news text.Still, their work does not handle some complex question types like "Why...", they merely aim to find the entity from the document to fill the slot in the query so that the completed query is true based on the document.Nevertheless, it inspires us to treat our task as a textual entailment problem by first reformatting question-answer pairs into statements.
Some other deep learning systems are developed for answer selection task [Yu et al., 2014;Yang et al., 2015;Severyn and Moschitti, 2015;Shen et al., 2015;Wang et al., 2010].Differently, this kind of question answering task does not involve document comprehension.They only try to match the question and answer candidate without any background information.Instead, we treat machine comprehension in this work as a question-answer matching problem under background guidance.
Overall, for open-domain MCTest machine comprehension task, this work is the first to resort to deep neural networks.

Model
We investigate this task by three approaches, illustrated in Figure 2. (i) We can compute two different document (D) representations in a common space, one based on question (Q) attention, one based on answer (A) attention, and compare them.This architecture we name HABCNN-QAP.(ii) We compute a representation of D based on Q attention (as before), but now we compare it directly with a representation of A. We name this architecture HABCNN-QP.(iii) We treat this QA task as textual entailment (TE), first reformatting Q-A pair into a statement (S), then matching S and D directly.This architecture we name HABCNN-TE.All three approaches are implemented in the common framework HABCNN.

HABCNN
Recall that we use the abbreviations A (answer), Q (question), S (statement), D (document).HABCNN performs representation learning for triple (Q, A, D) in HABCNN-QP and HABCNN-QAP, for tuple (S, D) in HABCNN-TE.For convenience, we use "query" to refer to Q, A, or S uniformly.HABCNN, depicted in Figure 3, has the following phases.
Input Layer.The input is (query,D).Query is two individual sentences (for Q, A) or one single sentence (for S), D is a sequence of sentences.Words are initialized by ddimensional pre-trained word embeddings.As a result, each sentence is represented as a feature map with dimensionality of d × s (s is sentence length).In Figure 3, each sentence in the input layer is depicted by a rectangle with multiple columns.
Sentence-CNN.Sentence-CNN is used for sentence representation learning from word level.Given a sentence of length s with a word sequence: v 1 , v 2 , . . ., v s , let vector c i ∈ R wd be the concatenated embeddings of w words v i−w+1 , . . ., v i where w is the filter width, d is the dimensionality of word representations and 0 < i < s + w.Embeddings for words v i , i < 1 and i > s, are zero padding.We then generate the representation p i ∈ R d1 for the phrase v i−w+1 , . . ., v i using the convolution weights W ∈ R d1×wd : where bias b ∈ R d1 .d 1 is called "kernel size" in CNN.
Note that the sentence-CNNs for query and all document sentences share the same weights, so that the resulting sentence representations are comparable.
Sentence-Level Representation.The sentence-CNN generates a new feature map (omitted in Figure 3) for each input sentence, one column in the feature map denotes a phrase representation (i.e., p i in Equation ( 1)).
For the query and each sentence of D, we do element-wise 1-max-pooling ("max-pooling" for short) [Collobert and Weston, 2008] over phrase representations to form their representations at this level.
We now treat D as a set of "vital" sentences and "noise" sentences.
We propose attention-pooling to learn the sentence-level representation of D as follows: first identify vital sentences by computing attention for each D's sentence as the cosine similarity between the its representation and the query representation, then select the k highest-attention sentences to do max-pooling over them.Taking Figure 3 as an example, based on the output of sentence-CNN layer, k = 2 important sentences with blue color are combined by maxpooling as the sentence-level representation v s of D; the other -white-color -sentence representations are neglected as they have low attentions.(If k = all, attention-pooling returns to the common max-pooling in [Collobert and Weston, 2008].)When the query is (Q,A), this step will be repeated, once for Q, once for A, to compute representations of D at the sentence level that are biased with respect to Q and A, respectively.
Snippet-CNN.As the example in Figure 1 shows, to answer the first question "why did Grandpa answer the door?", it does not suffice to compare this question only to the sentence "Grandpa answered the door with a smile and welcomed Jimmy inside"; instead, the snippet "Finally, Jimmy arrived at Grandpa's house and knocked.Grandpa answered the door with a smile and welcomed Jimmy inside" should be used to compare.To this end, it is necessary to stack another CNN layer, snippet-CNN, to learn representations of snippets, i.e., units of one or more sentences.Thus, the basic units input to snippet-CNN (resp.sentence-CNN) are sentences (resp.words) and the output is representations of snippets (resp.sentences).
Concretely, snippet-CNN puts all sentence representations in column sequence as a feature map and conducts another convolution operation over it.With filter width w, this step generates representation of snippet with w consecutive sentences.Similarly, we use the same CNN to learn higher-abstraction query representations (just treating query as a document with only one sentence, so that the higherabstraction query representation is in the same space with corresponding snippet representations).
Snippet-Level Representation.For the output of snippet-CNN, each representation is more abstract and denotes bigger granularity.We apply the same attention-pooling process to snippet level representations: attention values are computed as cosine similarities between query and snippets and the snippets with the k largest attentions are retained.Maxpooling over the k selected snippet representations then creates the snippet-level representation v t of D. Two selected snippets are shown as red in Figure 3.
Overall Representation.Based on convolution layers at two different granularity, we have derived query-biased representations of D at sentence level (i.e., v s ) as well as snippet level (i.e., v t ).In order to create a flexible choice for open Q/A, we develop a highway network [Srivastava et al., 2015] to combine the two levels of representations as an overall representation v o of D: where highway network weights h are learned by where W h ∈ R d1×d1 .With the same highway network, we can generate the overall query representation, r i in Figure 3, by combining the two representations of the query at sentence and snippet levels.

HABCNN-QP & HABCNN-QAP
HABCNN-QP/QAP computes the representation of D as a projection of D, either based on attention from Q or based on attention from A. We hope that these two projections of the document are close for a correct A and less close for an incorrect A. As we said in related work, machine comprehension can be viewed as an answer selection task using the document D as critical background information.Here, HABCNN-QP/QAP do not compare Q and A directly, but they use Q and A to filter the document differently, extracting what is critical for the Q/A match by attention-pooling.Then they match the two document representations in the new space.
For ease of exposition, we have used the symbol v o so far, but in HABCNN-QP/QAP we compute two different document representations: v oq , for which attention is computed with respect to Q; and v oa for which attention is computed with respect to A. r i also has two versions, one for Q: r iq , one for A: r ia .
HABCNN-QP and HABCNN-QAP make different use of v oq .HABCNN-QP compares v oq with answer representation r ia .HABCNN-QAP compares v oq with v oa .HABCNN-QAP projects D twice, once based on attention from Q, once based on attention from A and compares the two projected representations, shown in Figure 2 (top).HABCNN-QP only utilizes the Q-based projection of D and then compares the projected with the answer representation, shown in Figure 2 (middle).

HABCNN-TE
HABCNN-TE treats machine comprehension as textual entailment.We use the statements that are provided as part of MCTest.Each statement corresponds to a question-answer pair; e.g., the Q/A pair "Why did Grandpa answer the door?" / "Because he saw the insects" (Figure 1) is reformatted into the statement "Grandpa answered the door because he saw the insects".The question answering task is then cast as: "does the document entail the statement?" For HABCNN-TE, shown in Figure 2 (bottom), the input for Figure 3 is the pair (S,D).HABCNN-TE tries to match the S's representation r i with the D's representation v o .

Dataset
MCTest1 has two subsets.MCTest-160 is a set of 160 items, each consisting of a document, four questions followed by one correct anwer and three incorrect answers (split into 70 train, 30 dev and 60 test) and MCTest-500 a set of 500 items (split into 300 train, 50 dev and 150 test).

Training Setup and Tricks
Our training objective is to minimize the following ranking loss function: where S(•, •) is a matching score between two representation vectors.Cosine similarity is used throughout.α is a constant.
For this common ranking loss, we also have two styles to utilize the data in view of each positive answer is accompanied with three negative answers.One is treating (d, a + , a − 1 , a − 2 , a − 3 ) as a training example, then our loss function can have three "max()" terms, each for a positive-negative pair; the other one is treating (d, a + , a − i ) as an individual training example.In practice, we find the second way works better.We conjecture that the second way has more training examples, and positive answers are repeatedly used to balance the amounts of positive and negative answers.
Multitask learning: Question typing is commonly used and proved to be very helpful in QA tasks [Sachan et al., 2015].Inspired, we stack a logistic regression layer over question representation r iq , with the purpose that this subtask can favor the parameter tuning of the whole system, and finally the question is better recognized and is able to find the answer more accurately.
To be specific, we classify questions into 12 classes: "how", "how much", "how many", "what", "who", "where", "which", "when", "whose", "why", "will" and "other".The question label is created by querying for the label keyword in the question.If more than one keyword appears in a question, we adopt the one appearing earlier and the more specific one (e.g., "how much", not "how").In case there is no match, the class "other" is assigned.
We two evaluation metrics: accuracy (proportion of questions correctly answered) and NDCG 4 [Järvelin and Kekäläinen, 2002].Unlike accuracy which evaluates if the question is correctly answered or not, NDCG 4 , being a measure of ranking quality, evaluates the position of the correct answer in our predicted ranking.

Baseline Systems
This work focuses on the comparison with systems about distributed representation learning and deep learning: Addition.Directly compare question and answers without considering the D. Sentence representations are computed by element-wise addition over word representations.
Addition-proj.First compute sentence representations for Q, A and all D sentences as the same way as Addition, then match the two sentences in D which have highest similarity with Q and A respectively.
NR.The Neural Reasoner [Peng et al., 2015] has an encoding layer, multiple reasoning layers and a final answer layer.The input for the encoding layer is a question and the sentences of the document (called facts); each sentence is encoded by a GRU into a vector.In each reasoning layer, NR lets the question representation interact with each fact representation as reasoning process.Finally, all temporary reasoning clues are pooled as answer representation.
AR.The Attentive Reader [Hermann et al., 2015] is implemented by modeling the whole D as a word sequencewithout specific sentence / snippet representations -using an LSTM.Attention mechanism is implemented at word representation level.
Overall, baselines Addition and Addition-proj do not involve complex composition and inference.NR and AR represent the top-performing deep neural networks in QA tasks.

HABCNN Variants
In addition to the main architectures described above, we also explore two variants of ABCHNN, inspired by [Peng et al., 2015] and [Hermann et al., 2015], respectively.
Variant-I: As RNNs are widely recognized as a competitor of CNNs in sentence modeling, similar with [Peng et al., 2015], we replace the sentence-CNN in Figure 3 by a GRU while keeping other parts unchanged.
Variant-II: How to model attention at the granularity of words was shown in [Hermann et al., 2015]; see their paper for details.We develop their attention idea and model attention at the granularity of sentence and snippet.Our attention gives different weights to sentences/snippets (not words), then computes the document representation as a weighted average of all sentence/snippet representations.

Results
Table 2 lists the performance of baselines, HABCNN-TE variants, HABCNN systems in the first, second and last block, respectively (we only report variants for topperforming HABCNN-TE).Consistently, our HABCNN systems outperform all baselines, especially surpass the two competitive deep learning based systems AR and NR.The margin between our best-performing ABHCNN-TE and NR is 15.6/16.5 (accuracy/NDCG) on MCTest-150 and 7.3/4.6 on MCTest-500.This demonstrates the promise of our architecture in this task.
As said before, both AR and NR systems aim to generate answers in entity form.Their designs might not suit this machine comprehension task, in which the answers are openlyformed based on summarizing or abstracting the clues.To be more specific, AR models D always at word level, attentions are also paid to corresponding word representations, which is applicable for entity-style answers, but is less suitable for comprehension at sentence level or even snippet level.NR contrarily models D in sentence level always, neglecting the discovering of key phrases which however compose most of answers.In addition, the attention of AR system and the question-fact interaction in NR system both bring large numbers of parameters, this potentially constrains their power in a dataset of limited size.
For Variant-I and Variant-II (second block of Table 2), we can see that both modifications do harm to the original HABCNN-TE performance.The first variant, i.e, replacing the sentence-CNN in Figure 3 as GRU module is not helpful for this task.We suspect that this lies in the fundamental function of CNN and GRU.The CNN models a sentence without caring about the global word order information, and max-pooling is supposed to extract the features of key phrases in the sentence no matter where the phrases are located.This property should be useful for answer detection, as answers are usually formed by discovering some key phrases, not all words in a sentence should be considered.However, a GRU models a sentence by reading the words sequentially, the importance of phrases is less determined by the question requirement.The second variant, using a more complicated attention scheme to model biased D representations than simple cosine similarity based attention used in our model, is less effective to detect truly informative sentences or snippet.1) after the snippet-CNN.
doubt such kind of attention scheme when used in sentence sequences of large size.In training, the attention weights after softmax normalization have actually small difference across sentences, this means the system can not distinguish key sentences from noise sentences effectively.Our cosine similarity based attention-pooling, though pretty simple, is able to filter noise sentences more effectively, as we only pick top-k pivotal sentences to form D representation finally.This trick makes the system simple while effective.

Case Study and Error Analysis
In Figure 4, we visualize the attention distribution at sentence level as well as snippet level for the statement " Grandpa answered the door because Jimmy knocked" whose corresponding question requires multiple sentences to answer.From its left part, we can see that "Grandpa answered the door with a smile and welcomed Jimmy inside" has the highest attention weight.This meets the intuition that this sentence has semantic overlap with the statement.And yet this sentence does not contain the answer.Look further the right part, in which the CNN layer over sentence-level representations is supposed to extract high-level features of snippets.In this level, the highest attention weight is cast to the best snippet "Finally, Jimmy arrived...knocked.Grandpa answered the door...".And the neighboring snippets also get relatively higher attentions than other regions.Recall that our system chooses the one sentence with top attention at left part and choose top-3 snippets at right part (referring to k value in Table 1) to form D representations at different granularity, then uses a high-way network to combine both representations as an overall D representation.This visualization hints that our architecture provides a good way for a question to compromise key information from different granularity.We also do some preliminary error analysis.One big obstacle for our systems is the "how many" questions.For example, for question "how many rooms did I say I checked?" and the answer candidates are four digits "5,4,3,2" which never appear in the D, but require the counting of some locations.However, these digital answers can not be modeled well by distributed representations so far.In addition, digital answers also appear for "what" questions, like "what time did...".Another big limitation lies in "why" questions.This question type requires complex inference and long-distance dependencies.We observed that all deep lerning systems, including the two baselines, suffered somewhat from it.

Conclusion
This work takes the lead in presenting a CNN based neural network system for open-domain machine comprehension task.Our systems tried to solve this task in a document projection way as well as a textual entailment way.The latter one demonstrates slightly better performance.Overall, our architecture, modeling dynamic document representation by attention scheme from sentence level to snippet level, shows promising results in this task.In the future, more finegrained representation learning approaches are expected to model complex answer types and question types.

Figure 2 :
Figure 2: Illustrations of HABCNN-QAP (top), HABCHH-QP (middle) and HABCNN-TE (bottom).Q, A, S: question, answer, statement; D: document [Narasimhan andBarzilay, 2015;Sachan et al., 2015; Wang  and McAllester, 2015;Smith et al., 2015].In these works, a common route is first to define a regularized loss function based on assumed feature vectors, then the effort focuses on designing effective features based on various rules.Even though these researches are groundbreaking for this task, their flexibility and their capacity for generalization are limited.Deep learning based approaches appeal to increasing interest in analogous tasks.Weston et al., [2014]  introduce memory networks for factoid QA.Memory network framework is extended in[Weston et al., 2015;Kumar et al., 2015] for Facebook bAbI dataset.Peng et al. [2015]'s Neural Reasoner infers over multiple supporting facts to generate an entity answer for a given question and it is also tested on bAbI.All of these works deal with some short texts with simple-grammar, aiming to generate an answer which is restricted to be one word denoting a location, a person etc.Some works also tried over other kinds of QA tasks.For example, Iyyer et al.,[2014]  present QANTA, a recursive neural network, to infer an entity based on its description text.This task is basically a matching between description and entity, no explicit question exist.Another difference with us lies in that all the sentences in the entity description actually contain partial information about the entity, hence a description is supposed to have only one representation.However in our task, the modeling of document should be dynamically changed according to the question analysis.Hermann et al.,[2015]  incorporate attention mechanism into LSTM for a QA task over news text.Still, their work does not handle some complex question types like "Why...", they merely aim to find the entity from the document to fill the slot in the query so that the completed query is true based on the document.Nevertheless, it inspires us to treat our task as a textual entailment problem by first reformatting question-answer pairs into statements.Some other deep learning systems are developed for answer selection task[Yu et al., 2014;Yang et al., 2015;  Severyn and Moschitti, 2015; Shen et al., 2015;Wang et al., 2010].Differently, this kind of question answering task does not involve document comprehension.They only try to match the question and answer candidate without any background information.Instead, we treat machine comprehension in this work as a question-answer matching problem under background guidance.Overall, for open-domain MCTest machine comprehension task, this work is the first to resort to deep neural networks.

Figure 3 :
Figure 3: HABCNN.Feature maps for phrase representations p i and the max pooling steps that create sentence representations out of phrase representations are omitted for simplification.Each snippet covers three sentences in snippet-CNN.Symbols • mean cosine similarity calculation.

Figure 4 :
Figure 4: Attention visualization for statement "Grandpa answered the door because Jimmy knocked" in the example Figure 1.Too long sentences are truncated with ". ..". Left is attention weights for each single sentence after the sentence-CNN, right is attention weights for each snippet (two consecutive sentences as filter width w = 2 in Table1) after the snippet-CNN.