Assessing Relative Sentence Complexity using an Incremental CCG Parser

,


Introduction
The task of assessing text readability aims to classify text into different levels of difficulty, e.g., text comprehensible by a particular age group or second language learners (Petersen and Ostendorf, 2009;Feng, 2010;Vajjala and Meurers, 2014). There have been efforts to automatically simplify Wikipedia to cater its content for children and English language learners (Zhu et al., 2010;Woodsend and Lapata, 2011;Coster and Kauchak, 2011;Wubben et al., 2012;Siddharthan and Mandya, 2014). A related attempt of Vajjala and Meurers (2016) studied the usage of linguistic features for automatic classification of a pair of sentences -one from Standard Wikipedia and the other its corresponding simplification from Simple Wikipedia -into COMPLEX and SIMPLE. As syntactic features, they use information from phrase structure trees produced by a nonincremental parser, and found them useful.
However, psycholinguistic theories suggest that humans process text incrementally, i.e., humans build syntactic analysis interactively by enhancing current analysis or choosing an alternative analysis on the basis of the plausibility with respect to context (Marslen- Wilson, 1973;Altmann and Steedman, 1988;Tanenhaus et al., 1995). Besides being cognitively possible, incremental parsing has shown to be useful for many real-time applications such as language modeling for speech recognition (Chelba and Jelinek, 2000;Roark, 2001), modeling text reading time (Demberg and Keller, 2008), dialogue systems (Stoness et al., 2004) and machine translation (Schwartz et al., 2011). Furthermore, incremental parsers offer linear time speed. Here we explore the usefulness of incremental parsing for predicting relative sentence readability.
Given a pair of sentences -one sentence a simplified version of the other -we aim to classify the sentences into SIMPLE or COMPLEX. We use the sentences from Standard Wikipedia (WIKI) paired with their corresponding simplifications in Simple Wikipedia (SIMPLEWIKI) as training and evaluation data. We pose this problem as a pairwise classification problem (Section 2). For feature extraction, we use an incremental CCG parser which provides a trace of each step of the parse derivation (Section 3). Our evaluation results show that incremental parse features are more useful than non-incremental parse features (Section 5). With the addition of psycholinguistic features, we attain the best reported results on this task. We make our system available for public usage.

Problem Formulation
Initially Vajjala and Meurers (2014) trained a binary classifier to classify sentences in SIMPLEWIKI to the class SIMPLE, and sentences in WIKI to the class COMPLEX. This model performed poorly on relative readability assessment. Noting that not all SIMPLEWIKI sentences are simpler than every other sentence in WIKI, Vajjala and Meurers (2016) reframed the problem as a ranking problem according to which given a pair of parallel SIMPLEWIKI and WIKI sentences, the former must be ranked better than the latter in terms of readability. Inspired by Vajjala and Meurers (2016), we also treat each pair together, and model relative readability assessment as a pairwise classification problem. Let a, b be a pair of parallel sentences. Let a, b represent their corresponding feature vectors. We define our classifier Φ as The motivation for our modelling is that relative features (difference) are more useful than absolute features, e.g., intuitively shorter sentences are simple to read, but length can only be defined in comparison with another sentence.

Incremental CCG Parse Features
Below we provide necessary background, and then present the features.

Combinatory Categorial Grammar (CCG)
CCG  is a lexicalized formalism in which words are assigned syntactic types encoding subcategorization information. Figure 1 displays an incremental CCG derivation. Here, the syntactic type (category) (S\NP)/NP on ate indicates that it is a transitive verb looking for a NP John ate salad with mushrooms Figure 1: Incremental CCG derivation tree.
(object) on the righthand side and a NP (subject) on the lefthand side. Due to its lexicalized and strongly typed nature, the formalism offers attractive properties like elegant composition mechanisms which impose context-sensitive constraints, efficient parsing algorithms, and a synchronous syntaxsemantics interface. In Figure 1, the category of with (NP\NP)/NP combines with the category of mushrooms NP on its righthand side using the combinatory rule of forward application (indicated by >), to form the category NP\NP representing the phrase with mushrooms. This phrase in turn combines with other contextual categories using CCG combinators to form new categories representing larger phrases.
In contrast to phrase structure trees, CCG derivation trees encode a richer notion of syntactic type and constituency. For example, in a phrase structure tree, the category (constituency tag) of ate would be VBD irrespective of whether it is transitive or intransitive, whereas the CCG category distinguishes these types. As the linguistic complexity increases, the complexity of the CCG category may increase, e.g., the relative pronoun has the category (NP\NP)/(S\NP) in relative clause constructions. In addition, CCG derivation trees have combinators annotated at each level which indicate the way in which the category is derived, e.g., in Figure 1 the category S/NP of John ate is formed by first typeraising (indicated by >T) John and then applying forward composition (indicated by >B) with ate. CCG combinators can throw light into the linguistic complexity of the construction, e.g., crossed composition is an indicator of long-range dependency. Phrase structure trees do not have this additional information encoded on their nodes.  cremental CCG parser for English. 1 The main difference between this incremental version and standard non-incremental CCG parsers such as Zhang and Clark (2011) is that as soon as the grammar allows two types to combine, they are greedily combined. For example, in Figure 1, first John is pushed on the stack but is immediately reduced when its head ate appears on the stack (i.e., John's category combines with ate's category to form a new category), and similarly when salad is seen, it is reduced with ate. When with appears it waits to be reduced until its head mushrooms appears on the stack, and later mushrooms is reduced with salad via ate using a special revealing operation (indicated by R>) followed by a sequence of operations. The revealing operation is performed when a category has greedily consumed a head in advance of a subsequently encountered post-modifier to regenerate the head. In the non-incremental version, salad is not reduced with ate until with mushrooms is reduced with it.

Ambati et al. (2015) introduced a shift-reduce in-
Consider the following sentences (A) and (B) where (B) is a simpler version of (A). which is more complex compared to the category of to in (B) which is PP/NP. Both the derivations have one right reveal action (indicated by R >). In (A), the depth of this action is two since it is a VP coordination. 2 Whereas in (B) the depth is only one. Such information can be useful in predicting the complexity of a sentence.

Features
As discussed above, as the complexity of a sentence increases, the complexity of CCG categories, combinators and the number of revealing operations increase in the incremental analysis. We exploit this information to assess the readability of a sentence. For each sentence, we build a feature vector using the features defined below extracted from its incremental CCG derivation.  ward applications, compositions, forward compositions, backward compositions, left punctuations, right punctuations, coordinations, type-raisings, type-changing, left revealing, right revealing operations used in the CCG derivation. Each combinator is treated as a different feature dimension with its count as the feature value. For the revealing operations, we also add additional features which indicate the depth of the revealing which is analogous to surprisal (Hale, 2001).
CCG Categories. We define the complexity of a CCG category as the number of basic syntactic types used in the category, e.g., the complexity of (S[pss]\NP)/(S[to]\NP) is 4 since it has one S[pss], one S[to], and two NPs. Note that CCG type S[pss] indicates a sentence but of the subtype passive. We use average complexity of all the CCG categories used in the derivation as a real valued feature. In addition, we define integer-valued features representing the frequency of specific subtypes (we have 21 subtypes each defined as a different dimension) and the frequency of the top 8 syntactic types (each as a different dimension).

Evaluation Data
As evaluation data, we use WIKI and SIMPLEWIKI parallel sentence pairs collected by Hwang et al. (2015), a newer and larger version compared to Zhu et al. (2010)'s collection. We only use the pairs from the section GOOD consisting of 150K pairs. We further removed pairs containing identical sentences which resulted in 117K clean pairs. We randomly divided the data into training (60%), development (20%) and test (20%) splits.

Implementation details
As our classifier (see Section 2) we use SVM with Sequential Minimal Optimization in Weka toolkit (Hall et al., 2009) following its popularity in readability literature (Feng, 2010;Hancke et al., 2012;Vajjala and Meurers, 2014). 3 We use Ambati et al. (2015)'s CCG parser for extracting CCG derivations. This parser requires a CCG supertagger to limit its search space for which we use EasyCCG tagger (Lewis and Steedman, 2014).

Baseline
NON-INCREMENTAL PST. Following Vajjala and Meurers (2016), we use features extracted from Phrase Structure Trees (PST) produced by the Stanford parser (Klein and Manning, 2003), a nonincremental parser. We use the exact code used by Vajjala and Meurers (2016) to extract these features which include part-of-speech tags, constituency features like the number of noun phrases, verb phrases and preposition phrases, and the average size of the constituent trees. Vajjala and Meurers (2016) used a total of 57 features. 4

Results
First we analyze the impact of incremental CCG features (and so the name INCREMENTAL CCG).   (2016) have also used psycholinguistic features such as age of acquisition of words, word imagery ratings, word familiarity ratings, and ambiguity of a word, collected from the psycholinguistic repositories Celex (Baayen et al., 1995), MRC (Wilson, 1988), AoA (Kuperman et al., 2012) and Word-Net (Fellbaum, 1998). These features are found to be highly predictive for assessing readability. We enhance our syntactic models NON-INCREMENTAL PST and INCREMENTAL CCG by adding these psycholinguistic features to build NON-INCREMENTAL PST++ and INCREMENTAL CCG++ respectively.  Speed. In addition to accuracy, parsing speed is important in real-time applications. The Stanford parser took 204 minutes to parse the test data with a speed of 3.8 sentences per second. The incremental CCG parser took 16 minutes with an average speed of 47.5 sentences per second, a 12X improvement over the Stanford parser. These numbers include POS tagging time for the Stanford parser, and POS tagging and supertagging time for the incremental CCG parser. All the systems are run on the same hardware (Intel i5-2400 CPU @ 3.10GHz).

Conclusion
Our empirical evaluation on assessing relative sentence complexity suggests that syntactic features extracted from an incremental CCG parser are more useful than from a non-incremental phrase structure parser. This result aligns with psycholinguistic findings that human sentence processor is incremental. Our incremental model enhanced with psycholinguistic features achieves the best reported results on predicting relative sentence readability. We experimented with Simple Wikipedia and Wikipedia data from Hwang et al. (2015). We can explore the usefulness of our system on other datasets like On-eStopEnglish (OSE) corpus (Vajjala and Meurers, 2016) or the dataset from Xu et al. (2015). We are also currently exploring the usefulness of incremental analysis for psycholinguistic data by switching off the lookahead feature.