Syntactic and Lexical Approaches to Reading Comprehension

Among the challenges of teaching reading comprehension in K – 12 are identifying the portions of a text that are difficult for a student, comprehending major critical ideas, and understanding context-dependent polysemous words. We present a simple, unsupervised but robust and accurate syntactic method for achieving the first objective and a modified hierarchical lexical method for the second objective. Focusing on pinpointing troublesome sentences instead of the overall readability and on concepts central to a reading, we believe these methods will greatly facilitate efforts to help students improve reading skills


Introduction
Teaching reading comprehension and readability research are related but also different. Readability research generally focuses on ranking the difficult level of a passage while reading comprehension education more directly aims at helping students read better.
Although readability metrics offer a good indication of a passage's difficulty level, a more useful approach for teaching comprehension is to pick out those difficult sentences for specific, targeted learning. Although vocabulary is an important factor in making a sentence difficult, it also often happens that a sentence, either with no unknown words or after all the words have been looked up, is still difficult to understand. The following is an example from a 6 th grade history reading: "Nor have legitimate grounds ever failed a prince who wished to show colorable excuse for the non-fulfillment of his promise." 1 1 Niccolo Machiavelli, The Prince, Chapter XVII.
Even though the main idea was more or less clear, sentences like this were, in general, difficult for 6 th graders.
Sufficient background and vocabulary are two prerequisites of reading success, but beyond these two, what textual features are there that make a sentence hard? This is one question this paper addresses. The second question is how to help students understand all major critical ideas in a reading because in a passage, in addition to the main idea, there are major supporting details that are crucial to comprehension. For example, in Martin Luther King Jr.'s Beyond Vietnam speech, the main idea is to oppose the war in Vietnam and there are four major reasons given. Understanding these four reasons is as integral to the passage's comprehension as the main idea. The third question we address is how to help students understand in-context polysemous words. Together, this paper makes the following contributions: • A set of simple and accurate statistics that identifies, within a passage, the sentences that are challenging. • A set of interesting findings about the standardized reading tests. • A modified hierarchical lexical clustering method to find critical concepts in a reading. • A word2vec application for selecting incontext meaning of a word.

Previous Work
One focus of the previous NLP work on accessing text difficulties is readability ranking. For example, Lexile (Lennon, 2004), Flesch-Kincaid (Kincaid, 1975, Dale-Chall (Dale, 1948), Coleman-Liau (Coleman, 1975), and SMOG (McLaughlin, 1969) largely rely on words and sentence length. Since one or two long sentences or difficult words do not necessarily make a passage difficult, those systems give rankings for an entire passage or a book and are not aimed at pinpointing difficult sentences.
Recently , Pitler et. al. (2008), Peterson et. al. (2009), Kate et. al. (2010), Feng (2010, and Dascalu et. al. (2013) addressed the readability problem using supervised data and a richer set of linguistic features. However, their systems still focus on giving a readability score of the overall article, not individual sentences from which students can improve their reading comprehension. Pitler et. al. (2008) and Tanaka-Ishii et. al. (2010) also built comparators to decide relative difficulty between two sentences. Both and Tanaka -Ishii et. al. (2010) especially make heavy use of lexical features. All these models also require supervised data and vocabulary acquisition.
Works by François et. al. (2014), Siddharthan et. al. (2014), and Vajjala et. al. (2014 have focused on sentence simplification instead of sentence selection for the purpose of teaching reading comprehension. This paper provides a simple and robust method for identifying difficult sentences in a reading passage. We incorporate some of the standard features seen in previous work such as tree depth, but we also devise new features such as abstract appositives. While much of the previous research has made use of both lexical and syntactic features, our focus is on an in-depth study on syntax phenomena that contribute to sentence complexity. In addition to individual sentences that are hard to read, scattered concepts are also challenging to a reader. An author often develops a critical idea in several paragraphs using paraphrases, synonyms, and related ideas. When a reader cannot see the relation among these words and phrases, he will have difficulty grasping that concept. For this problem, we propose a word2vec-based (Mikolov, 2013) modified hierarchical clustering model to find clusters of concepts in a reading passage.

The Syntactic Features
We present a set of simple and robust features able to identify the difficult sentences in a reading. We show the efficacy of these features in a series of tests on grade-level readings.

The Features
Figures 1a -1f depict each feature in action. In the figure, each rectangular box describes what the feature is and how the feature is determined.

Feature Performance
Our goal is to find candidate sentences that are challenging for a young reader. This task is difficult to evaluate for two reasons: the lack of labeled data at sentence level and probably more importantly, the lack of a methodology for creating such a dataset. The creation of supervised data involves judgment from a young reader (under 16 years of age). First, young children often cannot articulate what they find difficult. Second, they sometimes think they understand a sentence while they don't. An attempt was made at a local tutoring center for children 11-16. Fifty-two children were given a grade-level passage and an above-grade passage (e.g. a hard SAT passage). They were asked to pick out the sentences they didn't understand. For both passages, more than 80% of the children either said they understood everything or they found the passage hard but couldn't tell where the difficulties were. They were then given multiple-choice questions. Fewer than 5% of the children who claimed they understood everything scored perfectly on the test. For more than 50% of the mistakes made, more than half the children claimed that it was not because they didn't understand the passage but because they were careless. This attempt showed that human judgment from a young reader is hard to obtain. Secondly, an approximation of difficulty via test performance is problematic. Perhaps, a possible approach is to convene expert reading teachers and ask them to, based on their field experiences, rank each sentence's difficulty level for each grade. This would require these teachers to have intimate knowledge of how children process sentences. For these reasons, we first evaluate the features by measuring how well they correspond to the changes in reading levels. We then use the features to rank the difficulty of each sentence and perform a qualitative assessment.
For the first part of the evaluation, we look for data that correlate well with grade levels. Representative grade-level readings are not easy to collect because readers in each grade vary greatly in their reading abilities 2 . We thus use passages in standardized tests. In this section, we present data from passages on the New York State ELA tests, which are annual tests given to students from grades 3 to 8. For high school reading data, we use the SAT test, a national test for high school students. Thus, the data represent standard reading levels of grades 3 to high school. We first run the Stanford parser (Manning et. al.,2014). We then collect statistics of the nine features on each sentence. The data statistics and feature performance are  Tables 2a -2c. For example, the increase in Delay from Grade 5 to Grade 6 is 95% statistically significant (p-value 0.003 < 0.05 in Table 2a). All significant changes are in bold. While the general trend is increasing through grades, sometimes decreases are observed in two adjacent grades. Many of the decreases are statistically insignificant such as the decrease in Delay from G3 to G4 with p-value of 0.13.
It is noticeable that in grades 3 -12, standard readings contain virtually none of the more specialized features of 1c-1f. These features are more prominent in older and more mature readings such as those in 19 th -century literature. In section 5, we use only features in 1a and 1b.      Next we rank the sentences. Each sentence has a vector of nine feature scores. Although many different weighing schemes are possibilities, we take the simple approach of uniform weights. We compare the top-3 most difficult sentences ranked by the nine features to those ranked by sentence length and tree depth. For lower-grade texts, there is almost no difference in the order. But for more complex passages, more significant differences start to show. Through this exercise, we also find a qualitative value of the nine features. Even when the rankings by our nine features agree with the length-based rankings, we can point out more specifically what makes these sentences difficult. These specifics are shown as Notes in Table 3. We believe the ability to locate these syntax phenomena for students should be helpful in improving their reading skills.

Rank
Sentence Top 1 by both Deeming that a serene and unconscious contemplation of him would best beseem me, and would be most likely to quell his evil mind, I advanced with that expression countenance, and was rather congratulating myself on my success, when suddenly the knees of Trabb's boy smote together, his hair uprose, his cap fell off, he trembled violently in every limb, staggered out into the road, and crying to the populace, "Hold me!" Notes: Specifically, in addition to a depth of 17 levels, two long delay (underlined), and a parallel phrase (double underlined). Top 2 by length and depth Words cannot state the amount of aggravation and injury wreaked upon me by Trabb's boy, when, passing abreast of me, he pulled up his shirt collar, twined his side-hair, stuck an arm akimbo, and smirked extravagantly by, wriggling his elbows and body, and drawling to his attendants, "Don't know yah, don't know yah, 'pon my soul don't know yah!" Top 2 by nine features The disgrace attendant on his immediately afterwards taking so crowing and pursuing me across the bridge with crows, as from an exceedingly dejected fowl who had known me when I was a blacksmith, culminated the disgrace with which I left the town, and was, so to speak, ejected by it into the open country. Notes: a long interruption of 18 words (underlined), one parallel phrase ("crowing and pursuing", double underline), and one PP fronting ("with which", italicized).

Top 3 by both
One or two of the tradespeople even darted out of their shops, and went a little way down the street before me, that they might turn, as if they had forgotten something, and pass me face to face -on which occasions I don't know whether they or I made the worse pretence; they of doing it, or I of not seeing it. Notes: Specific features are PP fronting (italicized) and one parallel phrase (underlined). Table 3. Sentence Ranking Example

The Lexical Approach
We now turn to finding critical ideas in a reading. Our concern is to find related and paraphrased words that contribute to the same idea.

An Example
We distinguish critical ideas from the main idea of a reading. Critical ideas are any ideas that the author develops to some extent. A crude definition is that a critical idea is an idea that the author mentions more than once. They may or may not be the main idea, but they should all contribute to the main idea. In the following short passage, there is one main idea and several critical ideas. "Black holes are the most efficient engines of destruction known to humanity. Their intense gravity is a one-way ticket to oblivion, and material spiraling into them can heat up to millions of degrees and glow brightly. Yet, they are not all-powerful. Even supermassive black holes are minuscule by cosmic standards. They typically account for less than one percent of their galaxy's mass. Accordingly, astronomers long assumed that supermassive holes, let alone their smaller cousins, would have little effect beyond their immediate neighborhoods. So it has come as a surprise over the past decade that black hole activity is closely intertwined with star formation occurring farther out in the galaxy." (SAT 2009 Practice Test) The main idea is the last sentence of the passage, but the many critical ideas that the author develops are: "black holes", "destruction", and "intertwined with star formation".

Finding Critical Ideas
The word2vec model (Mikolov, 2013) has been a widely used statistical model for encoding word meanings. We use a modified hierarchical cluster-ing algorithm using word2vec 3 as a representation of each word. First, cosine distances are computed on every word pair in the passage (after removing stopwords), resulting in an ! × ! matrix where n is the number of words. Unlike the traditional hierarchical clustering where the end result is a tree structure, our clustering is more flat and does not build a hierarchy. The linking criteria are two: (1) the distance between two words must exceed a minimum and (2) the distance between a word and an existing cluster must exceed a minimum percentage of the best pair in the cluster. The algorithm is in Figure 3.

Applications, Experiments and Results
In addition to identifying troublesome sentences, there are many other useful things possible with these features. Interesting experiments include comparing tests across many dimensions such as across geography and across standards.

State Difference?
The National Assessment of Educational Progress, or NEAP offers reading assessments to 4 th and 8 th graders nationwide. In 2015, all 52 states participated. A state may score higher than another state for a variety of reasons, economic, political, etc. In this experiment, we're interested in seeing if there might be any meaningful correlation at all between a state's NAEP score and the difficulty level of its state ELA 4 tests. To this end, we select Massachusetts, the top-ranking state whose NAEP score of 235 is considerably higher than the national average of 221, and compare its state ELA passages to those of New York whose score is 223. The data comparison is shown in Table 4a. The metrics are shown in Tables 4b and 4c where p-values are at 95% and the bold values indicate statistical significance. Again, the more specialized feature 'Inversion' is not a significant factor in 4 th and 8 th grade readings 5 . It's interesting to see that for both 4 th and 8 th grades, there is a progression of text difficulty from NY's ELA tests to MA's ELA tests. There are many reasons, both educational and noneducational, that come into play to influence one state's performance. Perhaps this could be a first step in better understanding the impact of increased level of difficulty on student reading performance.

SAT or ACT?
The SAT and the ACT are standardized tests college-bound juniors and seniors take. One common section in both tests is the Reading section where students are given passages to read and multiplechoice questions to answer. Students and parents have long wondered which test is easier. A simple online search of "SAT reading vs. ACT reading" yields many comparisons. The question of which test is easier depends on many factors such as timing, question types, and so on. What this paper is concerned with is not necessarily the simple yes/no answer to the question of which test is easier, but rather with comparing the passages on each reading test. From a simple survey at a local test preparation center, students who choose ACT all report that the ACT passages are more straightforward than those on the SAT, and those who take the SAT report that some SAT passages are harder to read, specifically in genres such as pre-1900 fictions and history. This fact does not directly lead to a judgment of which test is easier, simply that the ACT passages are easier to read 6 . To test this hypothesis and to quantify how much easier or harder the reading passages differ on each test, we collect passages from both tests and run the feature analysis on them. The data information is presented in Table 5a.

Test
Year of Test The results of the analysis are shown in Table  5b. ACT passages score uniformly lower than those on the SAT with majority of the difference being statistically significant. Table 5c shows that the standard deviations of the SAT are higher, indicating that the SAT passages have more variations. The two excerpts from each test in Table 6 give a qualitative view of the phenomenon where * indicates an example of increased complexity. ACT Humanities In 2008, the prodigiously gifted bassist, singer, and composer Esperanza Spalding released her major-label debut. Esperanza, which she recorded as a twenty-three-year-old instructor at the Berklee College of Music. ACT Science Pikas, a diminutive alpine-dwelling rabbit relative. are unique among alpine mammals in that they gather up vegetation throughout summerincluding flowers, grasses, leaves, evergreen needles, and even pine cones -and live off the hay pile throughout winter, rather than hibernating or moving downslope. * SAT Humanities: But of all relations, that between men and women, being the nearest and most intimate, and connected with the greatest number of strong emotions, was sure to be the last to throw off the old rule, and receive the new; for, in proportion to the strength of a feeling is the tenacity with which it clings to the forms and circumstances with which it has even accidentally become associated … SAT Science Nearly a half-century ago, Peter Higgs and a handful of other physicists were trying to understand the origin of a basic physical feature: mass. You can think of mass as an object's heft or, a little more precisely, as the resistance if offers to having its motion changed. Table 6. SAT and ACT Passage Difference Examples

Automatic Vocabulary Response
It is labor intensive to manually evaluate the efficacy of the word2vec-based lexical approach. While we annotate data for further research, we meanwhile evaluate the idea on vocabulary questions on the 8 released SAT official tests (CollegeBoard, 2009). These vocabulary questions ask the meaning of a word in the context of a given passage. The majority of the choices consist of one word each. Our baseline approach is to measure the vector cosine score between the word in question and the words in each choice. The choice with the greatest similarity score is chosen as the answer. When a choice has more than one word, we first remove the function words and then take the average of the vector scores.
We then apply a contextual word2vec model to the questions. For each word in a vocabulary question, we locate the sentence that the word occurs in and add up the vectors of all the content words in that sentence. The resultant vector is then compared to each choice in the vocabulary question. Table 7 shows that the context model outperforms baseline significantly. This experiment shows the power of combining context and a computable meaning representation such as the word2vec.  Table 7. Word2Vec-based Vocabulary Performance One reason the baseline performs poorly is that almost all words tested in the SAT vocabulary questions are polysemous. The word2vec is trained on mostly news data which biases the meaning of a word toward a typical news-oriented meaning. For example, the word 'consumption', without context, is most intuitively associated with consumer and commerce. In this question, of the five choices, "destruction", "viewing", "erosion", "purchasing", and "obsession", the most likely context-independent choice is "purchasing" and that is what the baseline model chooses. In the given passage, however, the enclosing sentence is "According to [this thesis], television consumption leads above all to moral dangers." After adding up all the vectors of the contextual words, the correct answer "viewing" surfaces and the context-model is able to answer that question correctly. This model makes concrete what the English teachers have meant when they instruct the students to look at the context. It also represents nicely the idea that the meaning of a word is selected by its surrounding words (the context).

Conclusion and Future Work
We present a set of straightforward and novel features to identify difficult sentences in a reading passage. In our experiments, the features correlate well with the actual grade of each text. We are also able to quantify and make more concrete of the differences between Common Core and pre-Common Core standards, and between different states. In the future, we hope to not only put all in an application for real use but also to incorporate general-purpose lexical features to further enhance reading comprehension education. Secondly, we intend to continue to investigate using word2vec as a stepping stone to distributed meaning representation. For example, extend critical ideas to multi-word phrases and tackle reading comprehension questions such as those on the SAT.