Machine Comprehension with Syntax, Frames, and Semantics

We demonstrate signiﬁcant improvement on the MCTest question answering task (Richardson et al., 2013) by augmenting baseline features with features based on syntax, frame semantics, coreference, and word embeddings, and combining them in a max-margin learning framework. We achieve the best results we are aware of on this dataset, outperforming concurrently-published results. These results demonstrate a signiﬁcant performance gradient for the use of linguistic structure in machine comprehension.


Introduction
Recent question answering (QA) systems (Ferrucci et al., 2010;Berant et al., 2013;Bordes et al., 2014) have focused on open-domain factoid questions, relying on knowledge bases like Freebase (Bollacker et al., 2008) or large corpora of unstructured text. While clearly useful, this type of QA may not be the best way to evaluate natural language understanding capability. Due to the redundancy of facts expressed on the web, many questions are answerable with shallow techniques from information extraction (Yao et al., 2014).
There is also recent work on QA based on synthetic text describing events in adventure games Sukhbaatar et al., 2015). Synthetic text provides a cleanroom environment for evaluating QA systems, and has spurred development of powerful neural architectures for complex reasoning. However, the formulaic semantics underlying these synthetic texts allows for the construction of perfect rule-based question answering systems, and may not reflect the patterns of natural linguistic expression.
In this paper, we focus on machine comprehension, which is QA in which the answer is con-tained within a provided passage. Several comprehension tasks have been developed, including Remedia (Hirschman et al., 1999), CBC4kids (Breck et al., 2001), and the QA4MRE textual question answering tasks in the CLEF evaluations (Peñas et al., 2011;Peñas et al., 2013;Clark et al., 2012;Bhaskar et al., 2012).
We consider the Machine Comprehension of Text dataset (MCTest; Richardson et al., 2013), a set of human-authored fictional stories with associated multiple-choice questions. Knowledge bases and web corpora are not useful for this task, and answers are typically expressed just once in each story. While simple baselines presented by Richardson et al. answer over 60% of questions correctly, many of the remaining questions require deeper analysis.
In this paper, we explore the use of dependency syntax, frame semantics, word embeddings, and coreference for improving performance on MCTest. Syntax, frame semantics, and coreference are essential for understanding who did what to whom. Word embeddings address variation in word choice between the stories and questions. Our added features achieve the best results we are aware of on this dataset, outperforming concurrently-published results (Narasimhan and Barzilay, 2015;Sachan et al., 2015).

Model
We use a simple latent-variable classifier trained with a max-margin criterion. Let P denote the passage, q denote the question of interest, and A denote the set of candidate answers for q, where each a ∈ A denotes one candidate answer. We want to learn a function h : (P, q) → A that, given a passage and a question, outputs a legal a ∈ A.
We use a linear model for h that uses a latent variable w to identify the sentence in the passage in which the answer can be found.
Let W denote the set of sentences within the passage, where a particular w ∈ W denotes one sentence.
Given a feature vector f (P, w, q, a) and a weight vector θ with an entry for each feature, the predictionâ for a new P and q is given by: , we minimize an 2 -regularized max-margin loss function: where λ is the weight of the 2 term and ∆(a, a i ) = 1 if a = a i and 0 otherwise. The latent variable w makes the loss function non-convex.

Features
We start with two features from Richardson et al. (2013). Our first feature corresponds to their sliding window similarity baseline, which measures weighted word overlap between the bag of words constructed from the question/answer and the bag of words in the window. We call this feature B. The second feature corresponds to their word distance baseline, and is the minimal distance between two word occurrences in the passage that are also contained in the question/answer pair. We call this feature D. Space does not permit a detailed description.

Frame Semantic Features
Frame semantic parsing (Das et al., 2014) is the problem of extracting frame-specific predicate-argument structures from sentences, where the frames come from an inventory such as FrameNet (Baker et al., 1998). This task can be decomposed into three subproblems: target identification, in which frame-evoking predicates are marked; frame label identification, in which the evoked frame is selected for each predicate; and argument identification, in which arguments to each frame are identified and labeled with a role from the frame. An example output of the SE-MAFOR frame semantic parser (Das et al., 2014) is given in Figure 1. Three frames are identified. The target words pulled, all, and shelves have respective frame labels CAUSE MOTION, QUANTITY, and NATU- RAL FEATURES. Each frame has its own set of arguments; e.g., the CAUSE MOTION frame has the labeled Agent, Theme, and Goal arguments. Features from these parses have been shown to be useful for NLP tasks such as slot filling in spoken dialogue systems . We expect that the passage sentence containing the answer will overlap with the question and correct answer in terms of predicates, frames evoked, and predicted argument labels, and we design features to capture this intuition. Given the frame semantic parse for a sentence, let T be the bag of frame-evoking target words/phrases. 1 We define the bag of frame labels in the parse as F . For each target t ∈ T , there is an associated frame label denoted F t ∈ F . Let R be the bag of phrases assigned with an argument label in the parse. We denote the bag of argument labels in the parse by L. For each phrase r ∈ R, there is an argument label denoted L r ∈ L. We define a frame semantic parse as a tuple T, F, R, L . We define six features based on two parsed sentences T 1 , F 1 , R 1 , L 1 and T 2 , F 2 , R 2 , L 2 : • f 1 : # frame label matches: |{ s, t : s ∈ F 1 , t ∈ F 2 , s = t}| • f 2 : # argument label matches: |{ s, t : s ∈ L 1 , t ∈ L 2 , s = t}|.
• f 5 : # target matches, using frame labels: We use two versions of each of these six features: one version for the passage sentence w and the question q, and an additional version for w and the candidate answer a.

Syntactic Features
If two sentences refer to the same event, then it is likely that they have some overlapping dependen-1 By bag, we mean here a set with possible replicates. cies. To compare a Q/A pair to a sentence in the passage, we first use rules to transform the question into a statement and insert the candidate answer into the trace position. Our simple rule set is inspired by the rich history of QA research into modeling syntactic transformations between questions and answers (Moschitti et al., 2007;Wang et al., 2007;Heilman and Smith, 2010). Given Stanford dependency tree and part-of-speech (POS) tags for the question, let arc(u, v) be the label of the dependency between child word u and head word v, let POS (u) be the POS tag of u, let c be the wh-word in the question, let r be the root word in the question's dependency tree, and let a be the candidate answer. We use the following rules: 2 • c = what, POS (r) = VB, and arc(c, r) = dobj.
Insert a after word u where arc(u, r) = nsubj. Delete c and the word after c.
• c = where, POS (r) = VB, and arc(c, r) = advmod. Delete c and the word after c. If r has a child u such that arc(u, r) = dobj, insert a after u; else, insert a after r and delete r.
• c = where, r = is, P OS(r) = VBZ, and arc(c, r) = advmod. Delete c. Find r's child u such that arc(u, r) = nsubj, move r to be right after u. Insert a after r.
We use other rules in addition to those above: change "why x?" to "the reason x is a", and change "how many x", "how much x", or "when x" to "x a".
Given each candidate answer, we attempt to transform the question to a statement using the rules above. 3 An example of the transformation is given in Figure 2. In the parse, pull is the root word and What is attached as a dobj. This matches the first rule, so we delete did and insert the candidate answer pudding after pull, making the final transformed sentence: James pull pudding off.
After this transformation of the question (and a candidate answer) to a statement, we measure its similarity to the sentence in the window using simple dependency-based similarity features. Denoting a dependency as (u, v, arc(u, v)), then two dependencies (u 1 , v 1 , arc(u 1 , v 1 )) and (u 2 , v 2 , arc(u 2 , v 2 )) match if and only if u 1 = u 2 , v 1 = v 2 , and arc(u 1 , v 1 ) = arc(u 2 , v 2 ). One feature simply counts the number of dependency matches between the transformed question and the passage sentence. We include three additional count features that each consider a subset of dependencies from the following three categories: (1) v = r and u = a; (2) v = r but u = a; and (3) v = r.
In Figure 2, the triples (James, pull, nsubj) and (off, pull, prt) belong to the second category while (pudding, pull, dobj) belongs to the first.

Word Embeddings
Word embeddings (Mikolov et al., 2013) represent each word as a low-dimensional vector where the similarity of vectors captures some aspect of semantic similarity of words. They have been used for many tasks, including semantic role labeling (Collobert et al., 2011), named entity recognition (Turian et al., 2010), parsing (Bansal et al., 2014), and for the Facebook QA tasks Sukhbaatar et al., 2015). We first define the vector f + w as the vector summation of all words inside sentence w and f × w as the elementwise multiplication of the vectors in w. To define vectors for answer a for question q, we concatenate q and a, then calculate f + qa and f × qa . For the bag-of-words feature B, instead of merely counting matches of the two bags of words, we also use cos(f + qa , f + w ) and cos(f × qa , f × w ) as features, where cos is cosine similarity. For syntactic features, where τ w is the bag of dependencies of w and τ qa is the bag of dependencies for the transformed question for candidate answer a, we use a feature function that returns the following:

702
where is short for arc(u, v). 4

Coreference Resolution
Coreference resolution systems aim to identify chains of mentions (within and across sentences) that refer to the same entity. We integrate coreference information into the bag-of-words, frame semantic, and syntactic features. We run a coreference resolution system on each passage, then for these three sets of features, we replace exact string match with a check for membership in the same coreference chain.
When using features augmented by word embeddings or coreference, we create new versions of the features that use the new information, concatenating them with the original features.

Experiments
MCTest splits its stories into train, development, and test sets. The original MCtest DEV is too small, to choose the best feature set, we merged the train and development sets in MC160 and MC500 and split them randomly into a 250-story training set (TRAIN) and a 200-story development set (DEV). We optimize the max-margin training criteria on TRAIN and use DEV to tune the regularizer λ and choose the best feature set. We report final performance on the original two test sets (for comparability) from MCTest, named MC160 and MC500.
We use SEMAFOR (Das et al., 2010;Das et al., 2014) for frame semantic parsing and the latest Stanford dependency parser (Chen and Manning, 2014) as our dependency parser. We use the Stanford rule-based system for coreference resolution (Lee et al., 2013). We use the pretrained 300-dimensional word embeddings downloadable from the word2vec site. 5 We denote the frame semantic features by F and the syntactic features by S. We use superscripts w and c to indicate the use of embeddings and coreference for a particular feature set. To minimize the loss, we use the miniFunc package in MATLAB with LBFGS (Nocedal, 1980;Liu and Nocedal, 1989).
The accuracy of different feature sets on DEV is given in Table 1. 6 The boldface results correspond to the best feature set combination chosen by evaluating on DEV. In this case, the feature dimensionality is 29, which includes 4 bag-of-words features, 1 distance feature, 12 frame semantic features, and with the remaining being syntactic features. After choosing the best feature set on DEV, we then evaluate our system on TEST.
Negations: in preliminary experiments, we found that our system suffered with negation questions, so we developed a simple heuristic to deal with them. We identify a question as negation if it contains "not" or "n't" and does not begin with "how" or "why". If a question is identified as negation, we then negate the final score for each candidate answer.  The final test results are shown in Table 2. We first compare to results from prior work (Richardson et al., 2013). Their first result uses a sliding window with the bag-of-words feature B described in Sec. 3; this system is called "Baseline 1" (B1). They then add the distance feature D, also described in Sec. 3. The combined system, which uses B and D, is called "Baseline 2" (B2). Their third result adds a rich textual entailment system to B2; it is referred to as B2+RTE. 7 We also compare to concurrently-published results (Narasimhan and Barzilay, 2015;Sachan et al., 2015).
We report accuracies for all questions as well as separately for the two types: those that are answerable with a single sentence from the passage ("Single") and those that require multiple sentences ("Multiple"). We see gains in accuracy of 6% absolute compared to the B2+RTE baseline and also outperform concurrently-published results (Narasimhan and Barzilay, 2015;Sachan et al., 2015). Even though our system only explicitly uses a single sentence from the passage when choosing an answer, we improve baseline accuracy for both single-sentence and multiplesentence questions. 8 score for all four candidate answers, then we get partial credit of 0.25 for this question. 7 These three results are obtained from files at http://research.microsoft.com/en-us/ um/redmond/projects/mctest/results.html. 8 However, we inspected these question annotations and   We also measure the contribution of each feature set by deleting it from the full feature set. These ablation results are shown in Table 3. We find that frame semantic and syntax features contribute almost equally, and using word embeddings contributes slightly more than coreference information. If we delete the bag-of-words and distance features, then accuracy drops significantly, which suggests that in MCTest, simple surface-level similarity features suffice to answer a large portion of questions.

Analysis
Successes To show the effects of different features, we show cases where the full system gives the correct prediction (marked with * ) but ablating the named features causes the incorrect answer (marked with †) to be predicted: occasionally found them to be noisy, which may cloud these comparisons.
When his mom asked him about his trip to the city Todd said, "There's no place like home." q: What did Todd say when he got home from the city? †B) There were so many people in cars; * C) There's no place like home; Errors To give insight into our system's performance and reveal future research directions, we also analyzed the errors made by our system. We found that many required inferential reasoning, counting, set enumeration, multiple sentences, time manipulation, and comparisons. Some randomly sampled examples are given below, with the correct answer starred ( * ): Ex. 1: requires inference across multiple sentences: One day Fritz got a splinter in his foot. Stephen did not believe him. Fritz showed him the picture. Then Stephen believed him. q: What made Stephen believe Fritz? * A) the picture of the splinter in his foot; †C) the picture of the cereal with milk; Ex. 2: requires temporal reasoning and world knowledge: Ashley woke up bright and early on Friday morning. Her birthday was only a day away. q: What day of the week was Ashley's birthday? * A) Saturday; †C) Friday; Ex. 3: requires comparative reasoning: Tommy has an old bicycle now. He is getting too big for it. q: What's wrong with Tommy's old bicycle? * B) it's too small; †C) it's old;

Conclusion
We proposed several novel features for machine comprehension, including those based on frame semantics, dependency syntax, word embeddings, and coreference resolution. Empirical results demonstrate substantial improvements over several strong baselines, achieving new state-of-theart results on MCTest. Our error analysis suggests that deeper linguistic analysis and inferential reasoning can yield further improvements on this task.