Machine Comprehension using Rich Semantic Representations

Machine comprehension tests the system’s ability to understand a piece of text through a reading comprehension task. For this task, we propose an approach us-ing the Abstract Meaning Representation (AMR) formalism. We construct meaning representation graphs for the given text and for each question-answer pair by merging the AMRs of comprising sentences using cross-sentential phenomena such as coreference and rhetorical structures. Then, we reduce machine comprehension to a graph containment problem. We posit that there is a latent mapping of the question-answer meaning representation graph onto the text meaning representation graph that explains the answer. We present a uniﬁed max-margin framework that learns to ﬁnd this mapping (given a corpus of texts and question-answer pairs), and uses what it learns to answer questions on novel texts. We show that this approach leads to state of the art results on the task.


Introduction
Learning to efficiently represent and reason with natural language is a fundamental yet longstanding goal in NLP. This has led to a series of efforts in broad-coverage semantic representation (or "sembanking"). Recently, AMR, a new semantic representation in standard neo-Davidsonian (Davidson, 1969;Parsons, 1990) framework has been proposed. AMRs are rooted, labeled graphs which incorporate PropBank style semantic roles, within-sentence coreference, named entities and the notion of types, modality, negation, quantification, etc. in one framework.
In this paper, we describe an approach to use The question and answer candidate are combined to generate a hypothesis. This hypothesis is AMR parsed to construct a hypothesis meaning representation graph after some post-processing ( § 2.1). Similar processing is done for each sentence in the passage as well. Then, a subset (not necessarily contiguous) of these sentence meaning representation graphs is found. These representation subgraphs are further merged using coreference information, resulting into a structure called the relevant text snippet graph. Finally, the hypothesis meaning representation graph is aligned to the snippet graph. The dashed red lines show node alignments, solid red lines show edge alignments, and thick solid black arrow shows the rhetorical structure label (elaboration).
AMR for the task of machine comprehension. Machine comprehension (Richardson et al., 2013) evaluates a machine's understanding by posing a series of multiple choice reading comprehension tests. The tests are unique as the answer to each question can be found only in its associated texts, requiring us to go beyond simple lexical solutions.
Our approach models machine comprehension as an extension to textual entailment, learning to output an answer that is best entailed by the passage. It works in two stages. First, we construct a meaning representation graph for the entire passage ( § 2.1) from the AMR graphs of comprising sentences. To do this, we account for crosssentence linguistic phenomena such as entity and be   name   person  name  Sammy  op1  name  do m ai n   dog  person  name  Katy  op1  name  poss pr ep -o f arg1 Figure 2: The AMR parse for the hypothesis in Figure 1. The person nodes are merged to achieve the hypothesis meaning representation graph. event coreference, and rhetorical structures. A similar meaning representation graph is also constructed for each question-answer pair. Once we have these graphs, the comprehension task henceforth can be reduced to a graph containment problem. We posit that there is a latent subgraph of the text meaning representation graph (called snippet graph) and a latent alignment of the questionanswer graph onto this snippet graph that entails the answer (see Figure 1 for an example). Then, we propose a unified max-margin approach ( § 2.2) that jointly learns the latent structure (subgraph selection and alignment) and the QA model. We evaluate our approach on the MCTest dataset and achieve competitive or better results than a number of previous proposals for this task.

The Meaning Representation Graphs
We construct the meaning representation graph using individual sentences AMR graphs and merging identical concepts (using entity and event coreference). First, for each sentence AMR, we merge nodes corresponding to multi-word expressions and nodes headed by a date entity ("date-entity"), or a named entity ("name") or a person entity ("person"). For example, the hypothesis meaning representation graph in Figure 1 was achieved by merging the AMR parse shown in Figure 2.
Next, we select the subset of sentence AMRs corresponding to sentences needed to answer the question. This step uses cross-sentential phenomena such as rhetorical structures 1 and entities/event coreference. The coreferent entities/event mentions are further merged into one node resulting in a graph called the relevant text snippet graph. A similar process is also per-1 Rhetorical structure theory (Mann and Thompson, 1988) tells us that sentences with discourse relations are related to each other. Previous works in QA (Jansen et al., 2014) have shown that these relations can help us answer certain kinds of questions. As an example, the "cause" relation between sentences in the text can often give cues that can help us answer "why" or "how" questions. Hence, the passage meaning representation also remembers RST relations between sentences. formed with the hypothesis sentences (generated by combining the question and answer candidate) as shown in Figure 1.

Max-Margin Solution
For each question q i ∈ Q, let t i be the corresponding passage text and A i = {a i1 , . . . , a im } be the set of candidate answers to the question. Our solution casts the machine comprehension task as a textual entailment task by converting each question-answer candidate pair (q i , a ij ) into a hypothesis statement h ij . We use the question matching/rewriting rules described in Cucerzan and Agichtein (2005) to get the hypothesis statements. For each question q i , the machine comprehension task reduces to picking the hypothesisĥ i that has the highest likelihood of being entailed by the text t i among the set of hypotheses h i = {h i1 , . . . , h im } generated for the question q i . Let h * i ∈ h i be the hypothesis corresponding to the correct answer.
As described, we use subgraph matching to help us model the inference. We assume that the selection of sentences to generate the relevant text snippet graph and the mapping of the hypothesis meaning representation graph onto the passage meaning representation graph is latent and infer it jointly along with the answer. We treat it as a structured prediction problem of ranking the hypothesis set h i such that the correct hypothesis h * i is at the top of this ranking. We learn a scoring function S w (t, h, z) with parameter w such that the score of the correct hypothesis h * i and corresponding best latent structure z * i is higher than the score of the other hypotheses and corresponding best latent structures. In a max-margin fashion, we Writing the relaxed max margin formulation: If the scoring function is convex then this objective is in concave-convex form and hence can be solved by the concave-convex programming procedure (CCCP) (Yuille and Rangarajan, 2003). We assume the scoring function to be linear:S w (t, h, z) = w T ψ(t, h, z). Here, ψ(t, h, z) is a feature map discussed later. The CCCP algorithm essentially alternates between solving for z * i , z ij ∀j s.t. h ij ∈ h i \ h * i and w to achieve a local minima. In the absence of information regarding the latent structure z we pick the structure that gives the best score for a given hypothesis i.e. arg max z S w (t, h, z).

Scoring Function and Inference
Now, we define the scoring function S w (t, h, z). Let the hypothesis meaning representation graph be G = (V , E ). Our latent structure z decomposes into the selection (z s ) of relevant sentences that lead to the text snippet graph G, and the mapping (z m ) of every node and edge in G onto G. We define the score such that it factorizes over the nodes and edges in G . The weight vector w also has three components w s , w v and w e corresponding to the relevant sentences selection, node matches and edge matches respectively. An edge in the graph is represented as a triple (v 1 , r, v 2 ) consisting of the enpoint vertices and relation r.
Here, t is the text corresponding to the hypothesis h, and f are parts of the feature map ψ to be described later. z(v ) maps a node v ∈ V to a node in V . Similarly, z(e ) maps an edge e ∈ E to an edge in E.
Next, we describe the inference procedure i.e. how to select the structure that gives the best score for a given hypothesis. The inference is performed in two steps: The first step selects the relevant sentences from the text. This is done by simply maximizing the first part of the score: z s = arg max zs w T s f (G , G, t, h, z s ). Here, we only consider subsets of 1, 2 and 3 sentences as most questions can be answered by 3 sentences in the passage. The second step is formulated as an integer linear program by rewriting the scoring function. The ILP objective is: Here, with some abuse of notation, z v ,v and z e ,e are binary integers such that z v ,v = 1 iff z maps v onto v else z v ,v = 0. Similarly, z e ,e = 1 iff z maps e onto e else z e ,e = 0. Additionally, we have the following constrains to our ILP: • Each node v ∈ V (or each edge e ∈ E ) is mapped to exactly one node v ∈ V (or one edge e ∈ E). Hence: v∈V z v ,v = 1 ∀v and e∈E z e ,e = 1 ∀e • If an edge e ∈ E is mapped to an edge e ∈ E, then vertices (v 1 e , v 2 e ) that form the end points of e must also be aligned to vertices (v 1 e , v 2 e ) that form the end points of e. Here, we note that AMR parses also have inverse relations such as "arg0-of". Hence, we resolve this with a slight modification. If neither or both relations (corresponding to edges e and e) are inverse relations (case 1), we enforce that v 1 e align with v 1 e and v 2 e align with v 2 e . If exactly one of the relations is an inverse relation (case 2), we enforce that v 1 e align with v 2 e and v 2 e align with v 1 e . Hence, we introduce the following constraints:

Features
Our feature function ψ(t, h, z) decomposes into three parts, each corresponding to a part of the latent structure.
The first part corresponds to relevant sentence selection. Here, we include features for matching local neighborhoods in the sentence subset and the hypothesis: features for matching bigrams, trigrams, dependencies, semantic roles, predicateargument structure as well as the global syntactic structure: a graph kernel for matching AMR graphs of entire sentences (Srivastava and Hovy, 2013). Before computing the graph kernel, we reverse all inverse relation edges in the AMR graph. Note that if a sentence subset contains the answer to the question, it should intuitively be similar to the question as well as to the answer. Hence, we add features that are the element-wise product of features for the subset-question match and subset-answer match. In addition to features for the exact word/phrase match of the snippet and the hypothesis, we also add features using two paraphrase databases: ParaPara (Chan et al., 2011) and DIRT (Lin and Pantel, 2001). These databases contain paraphrase rules of the form string 1 → string 2 . ParaPara rules were extracted through bilingual pivoting and DIRT rules were extracted using the distributional hypothesis. Whenever we have a substring in the text snippet that can be transformed into another using any of these two databases, we keep match features for the substring with a higher score (according to the current w) and ignore the other substring. Finally, we also have features corresponding to the RST (Mann and Thompson, 1988) links to enable inference across sentences. RST tells us that sentences with discourse relations are related to each other and can help us answer certain kinds of questions (Jansen et al., 2014). For example, the "cause" relation between sentences in the text can often give cues that can help us answer "why" or "how" questions. Hence, we have additional featuresconjunction of the rhetorical structure label from a RST parser and the question word as well.
The second part corresponds to node matches. Here, we have features for (a) Surface-form match (Edit-distance), and (b) Semantic word match (cosine similarity using SENNA word vectors (Collobert et al., 2011) and "Antonymy" 'Class-Inclusion' or 'Is-A' relations using Wordnet).
The third part corresponds to edge matches. Let the edges be e = (v 1 , r, v 2 ) and e = (v 1 , r , v 2 ) for notational convenience. Here, we introduce two features based on the relations -indicator that the two relations are the same or inverse of each other, indicator that the two relations are in the same relation category -categories as described in Banarescu et al. (2013). Then, we introduce a number of features based on distributional representation of the node pairs. We compute three vertex vector compositions (sum, difference and product) of the nodes for each edge proposed in recent representation learning literature in NLP (Mitchell and Lapata, 2008;Mikolov et al., 2013) i.e. v 1 v 2 and v 1 v 2 for = {+, −, ×}. Then, we compute the cosine similarities of the resulting compositions producing three features. Finally we introduce features based on the structured distributional semantic representation (Erk and Padó, 2008;Baroni and Lenci, 2010;Goyal et al., 2013) which takes the relations into account while performing the composition. Here, we use a large text corpora (in our experiments, the English Wikipedia) and construct a representation matrix M (r) ⊂ V × V for every relation r (V is the vocabulary) where, the ij th element M (r) ij has the value log(1+x) where x is the frequency for the i th and j th vocabulary items being in relation r in the corpora. This allows us to compose the node and relation representations and compare them. Here we compute the cosine similarity of the compositions (v 1 ) T M (r) and (v 1 ) T M (r ) , the compositions M (r) v 2 and M (r ) v 2 and their repective sums (v 1 ) T M (r) + M (r) v 2 and (v 1 ) T M (r ) + M (r ) v 2 to get three more features.

Negation and Multi-task Learning
Next, we borrow two ideas from Sachan et al. (2015) namely, negation and multi-task learning, treating different question types in the machine comprehension setup as different tasks.
Handling negation is important for our model as facts align well with their negated versions. We use a simple heuristic. During training, if we detect negation (using a set of simple rules that test for presence of negation words ("not", "n't", etc.)), we flip the corresponding constraint, now requiring that the correct hypothesis to be ranked below all the incorrect ones. During test phase if we detect negation, we predict the answer corresponding to the hypothesis with the lowest score.
QA systems often include a question classification component that divides the questions into semantic categories based on the type of the question or answers expected. This allows the model to learn question type specific parameters when needed. We experiment with three task classifications proposed by Sachan et al. (2015). First is QClassification, which classifies the question, based on the question word (what, why, what, etc.). Next is the QAClassification scheme, which classifies questions into different semantic classes based on the possible semantic types of the answers sought. The third scheme, TaskClassification classifies the questions into one of 20 subtasks for Machine Comprehension proposed in Weston et al. (2015). We point the reader to Sachan et al. (2015) for details on the multi-task model.

Experiments
Datasets: We use MCTest-500 dataset (Richardson et al., 2013), a freely available set of 500 stories (300 train, 50 dev and 150 test) and associated questions to evaluate our model. Each story in MCTest has four multiple-choice questions, each with four answer choices. Each question has exactly one correct answer. Each question is also annotated as 'single' or 'multiple'. The questions annotated 'single' require just one sentence in the passage to answer them. For 'multiple' questions it should not be possible to find the answer to the question with just one sentence of the passage. In a sense, 'multiple' questions are harder than 'single' questions as they require more complex inference. We will present the results breakdown for 'single' or 'multiple' category questions as well. Baselines: We compare our approach to the following baselines: (1-3) The first three baselines are taken from Richardson et al. (2013). SW and SW+D use a sliding window and match a bag of words constructed from the question and the candidate answer to the text. RTE uses textual entailment by selecting the hypothesis that has the highest likelihood of being entailed by the passage. (4) LEX++, taken from Smith et al. (2015) is another lexical matching method that takes into account multiple context windows, question types and coreference. (5) JACANA uses an off the shelf aligner and aligns the hypothesis statement with the passage. (6-7) LSTM and QANTA, taken from Sachan et al. (2015), use neural networks (LTSMs and Recursive NNs, respectively). (8) ATTENTION, taken from Yin et al. (2016), uses an attention-based convolutional neural network. (9) DISCOURSE, taken from Narasimhan and Barzilay (2015), proposes a discourse based model.
(10-14) LSSVM, LSSVM+Negation, LSSVM+Negation (MultiTask), taken from Sachan et al. (2015) are all discourse aware latent structural svm models. LSSVM+Negation accounts for negation. LSSVM+Negation+MTL further incoporates multi-task learning based on question types. Here, we have three variants of multitask learners based on the three question classification strategies. (15) Finally, SYN+FRM+SEM, taken from Wang et al. (2015) proposes a framework with features based on syntax, frame semantics, coreference and word embeddings. Results: We compare our AMR subgraph containment approach 2 where we consider our modifications for negation and multi-task learning as well in Table 1. We can observe that our models have a comparable performance to all the baselines including the neural network approaches and all previous approaches proposed for this task. Further, when we incorporate multi-task learning, our approach achieves the state of the art. Also, our approaches have a considerable improvement over the baselines for 'multiple' questions. This shows All differences between the baselines (except SYN+FRM+SEM) and our approaches, and the improvements due to negation and multi-task learning are significant (p < 0.05) using the two-tailed paired T-test.
the benefit of our latent structure that allows us to combine evidence from multiple sentences. The negation heuristic helps significantly, especially for 'single' questions (majority of negation cases in the MCTest dataset are for the "single" questions). The multi-task method which performs a classification based on the subtasks for machine comprehension defined in Weston et al. (2015) does better than QAClassification that learns the question answer classification. QAClassification in turn performs better than QClassification that learns the question classification only. These results, together, provide validation for our approach of subgraph matching over meaning representation graphs, and the incorporation of negation and multi-task learning.

Conclusion
We proposed a solution for reading comprehension tests using AMR. Our solution builds intermediate meaning representations for passage and question-answers. Then it poses the comprehension task as a subgraph matching task by learning latent alignments from one meaning representation to another. Our approach achieves competitive or better performance than other approaches proposed for this task. Incorporation of negation and multi-task learning leads to further improvements establishing it as the new state-of-the-art.