Lexical Features Are More Vulnerable, Syntactic Features Have More Predictive Power

Understanding the vulnerability of linguistic features extracted from noisy text is important for both developing better health text classification models and for interpreting vulnerabilities of natural language models. In this paper, we investigate how generic language characteristics, such as syntax or the lexicon, are impacted by artificial text alterations. The vulnerability of features is analysed from two perspectives: (1) the level of feature value change, and (2) the level of change of feature predictive power as a result of text modifications. We show that lexical features are more sensitive to text modifications than syntactic ones. However, we also demonstrate that these smaller changes of syntactic features have a stronger influence on classification performance downstream, compared to the impact of changes to lexical features. Results are validated across three datasets representing different text-classification tasks, with different levels of lexical and syntactic complexity of both conversational and written language.


Introduction
It is important to understand the vulnerability of linguistic features to text alteration because (1) pre-defined linguistic features are still frequently used in health text classification, e.g., detecting Alzheimers disease (AD) (Masrani et al., 2017;Zhu et al., 2018;Balagopalan et al., 2018), aphasia (Fraser et al., 2015), or sentiment from language (Maas et al., 2011); and (2) understanding the importance of syntactic and lexical information separately as well as interactively is still an open research area in linguistics (Lester et al., 2017;Blaszczak, 2019).
Lexical richness and complexity relate to nuances and the intricacy of meaning in language.Numerous metrics to quantify lexical diversity, such as type-token ratio (TTR) (Richards, 1987) and MLTD (McCarthy, 2005), have been proposed.These metrics capture various dimensions of meaning, quantity and quality of words, such as variability, volume, and rarity.Several of these have been identified to be important for a variety of tasks in applied linguistics (Daller et al., 2003).For example, metrics related to vocabulary size, such as TTR and word-frequencies, have proven to help with early detection of mild cognitive impairment (MCI) (Aramaki et al., 2016), hence are important for early dementia diagnosis.Discourse informativeness, measured via propositional idea density, is also shown to be significantly affected in speakers with aphasia (Bryant et al., 2013).Furthermore, lexicon-based methods have proved to be successful in sentiment analysis (Taboada et al., 2011;Tang et al., 2014).
Syntactic complexity is evident in language production in terms of syntactic variation and sophistication or, in other words, the range and degree of sophistication of the syntactic structures that are produced (Lu, 2011;Ortega, 2003).This construct has attracted attention in a variety of languagerelated research areas.For example, researchers have examined the developmental trends of child syntactic acquisition (e.g., (Ramer, 1977)), the role of syntactic complexity in treating syntactic deficits in agrammatical aphasia (e.g., (Melnick and Conture, 2000;Thompson et al., 2003)), the relationship between syntactic complexity in early life to symptoms of Alzheimers disease in old age (e.g., (Kemper et al., 2001;Snowdon et al., 1996)), and the effectiveness of syntactic complexity as a predictor of adolescent writing quality (e.g., (Beers and Nagy, 2009)).Indefrey et al. (2001) reported data on brain activation during syntactic processing and demonstrated that syntactic processing in the human brain happens independently of the processing of lexical meaning.These results were supported by the more recent studies showing that different brain regions support distinct mechanisms in the mapping from a linguistic form onto meaning, thereby separating syntactic agrammaticality from linguistic complexity (Ullman et al., 2005;Friederici et al., 2006).This motivates us to explore the importance of lexical and syntactic features separately.
To our knowledge, there is no previous research in medical text classification area exploring the individual value of lexical and syntactic features with regards to their vulnerability and importance for ML models.Syntactic and lexical feature groups are often used together without specifying their individual value.For example, recent work in text classification for AD detection revealed that a combination of lexical and syntactic features works well (Fraser et al., 2016;Noorian et al., 2017); the same is true for other cognitive disease or language impairment detection (Meteyard and Patterson, 2009;Fraser et al., 2014), as well as sentiment detection in healthy speech and language (Negi and Buitelaar, 2014;Marchand et al., 2013;Pang et al., 2002).
In this paper, we focus on individual value of lexical and syntactic feature groups, as studied across medical text classification tasks, types of language, datasets and domains.As such, the main contributions of this paper are: • Inspired by the results of neuroscience studies (Indefrey et al., 2001), we explore selective performance of lexical and syntactic feature groups separately.
• We demonstrate, using multiple analysis methods, that there is a clear difference in how lexical features endure text alterations in comparison to the syntactic ones as well as how the latter impact classification.
• We report results on three different datasets and four different classifiers, which allows us to draw more general conclusions.
• We conduct an example-based analysis that explains the results obtained during the analysis.

Related Work
Prior research reports the utility of different modalities of speech -lexical and syntactic (Bucks et al., 2000;Fraser et al., 2016;Noorian et al., 2017;Zhu et al., 2019) -in detecting dementia.Bucks et al. (2000)  Similarly, varying feature sets have been used for detecting aphasia from speech.Researchers have studied the importance of syntactic complexity indicators such as Yngve-depth and length of various syntactic representations for detecting aphasia (Roark et al., 2011), as well as lexical characteristics such as average frequency and the imageability of words used (Bird et al., 2000).Patterns in production of nouns and verbs are also particularly important in aphasia detection (Wilson et al., 2010;Meteyard and Patterson, 2009).Fraser et al. (2014) used a combination of syntactic and lexical features with ASR-transcription for the diagnosis of primary progressive aphasia with a cross-validated accuracy of 100% within a dataset of 30 English-speakers.More recently, Le et al. (2017) proposed methods to detect paraphasia, a type of language output error commonly associated with aphasia, in aphasic speech using phone-level features.
Sentiment analysis methodologies often use lexicon-based features (Taboada et al., 2011;Tang et al., 2014).Syntactic characteristics of text such as proportions of verbs and adjectives, nature of specific clauses in sentences are also salient in sentiment detection (Chesley et al., 2006;Meena and Prabhakar, 2007).Additionally, systems using both syntactic and lexical features have been proposed in prior work (Negi and Buitelaar, 2014;Marchand et al., 2013).For example, Marchand et al. (2013) trained ML models on patterns in syntactic parse-trees and occurrences of words from a sentiment lexicon to detect underlying sentiments from tweets while Negi and Buitelaar (2014) employed syntactic and lexical features for sentence level aspect based sentiment analysis.Pang et al.
Table 1: Comparison of the datasets in terms of task nature, type of language used to collect the data, lexical and syntactic complexity.
(2002) showed that unigrams, bigrams and frequencies of parts-of-speech tags such as verbs and adjectives are important for an ML-based sentiment classifier.

Datasets
In the following section, we provide context on each of three similarly-sized datasets that we investigate that differ in the following ways (see also Section 4): 1. Binary text classification task (AD detection, sentiment classification, aphasia detection).
2. Type of language 3. Level of lexical and syntactic complexity.

DementiaBank (DemB)
DementiaBank1 is the largest publicly available dataset for detecting cognitive impairments, and is a part of the TalkBank corpus (MacWhinney, 2007).It consists of audio recordings of verbal descriptions and associated transcripts of the Cookie Theft picture description task from the Boston Diagnostic Aphasia Examination (Becker et al., 1994) from 210 participants aged between 45 to 90.Of these participants, 117 have a clinical diagnosis of AD (N = 180 speech recordings), while 93 (N = 229 speech recordings) are cognitively healthy.Many participants repeat the task within an interval of a year.

AphasiaBank (AphB)
AphasiaBank2 (MacWhinney, 2007) is another dataset of pathological speech that consists of aphasic and healthy control speakers performing a set of standard clinical speech-based tasks.The dataset includes audio samples of speech and associated transcripts.All participants perform multiple tasks, such as describing pictures, storytelling, free speech, and discourse with a fixed protocol.Aphasic speakers have various sub-types of aphasia (fluent, non-fluent, etc.).In total, there are 674 samples, from 192 healthy (N = 246 speech samples) and 301 (N = 428 speech samples) aphasic speakers.

IMDB Sentiment Extract (IMDBs)
The IMDB Sentiment (Maas et al., 2011) dataset is a standard corpus for sentiment detection that contains typewritten reviews of movies from the IMDB database along with the review-associated binary sentiment polarity labels (positive and negative).This dataset is used in order to extend the range of 'healthy' language and test generalizability of our findings.The core dataset consists of 50,000 reviews split evenly into train and test sets (with equal classes in both train and test).To maintain a comparable dataset size to DemB and AphB, we randomly choose 250 samples from the train sets of each polarity, totalling 500 labeled samples.
All the three datasets cover a breadth of transcripts in terms of presence or absence of impairment, as well as a spectrum of 'healthy' speech.

Feature Extraction
Following multiple previous works on text classification, we extract two groups of linguistic features -lexical and syntactic.
Lexical features: Features of lexical domain have been recognized as an important construct in a number of research areas, including stylistics, text readability analysis, language assessment, first and second language acquisition, and cognitive disease detection.In order to measure various dimensions of lexical richness in the datasets under comparison, we compute statistics on token/unigram, bigram, and trigram counts.Additionally, we use the Lexical Complexity Analyser (Ai and Lu, 2010) to measure various dimensions of lexical richness, such as lexical density, sophistication, and variation.
Following Oraby et al. (2018), Dušek et al. (2019), andJagfeld et al. (2018), we also use Shannon entropy (Manning and Schtze, 2000, p. 61ff.) as a measure of lexical diversity in the texts: Here, x stands for all unique tokens/n-grams, freq stands for the number of occurrences in the text, and len for the total number of tokens/ngrams in the text.We compute entropy over tokens (unigrams), bigrams, and trigrams.
We further complement Shannon text entropy with n-gram conditional entropy for next-word prediction (Manning and Schtze, 2000, p. 63ff.), given one previous word (bigram) or two previous words (trigram): Here, (c, w) stands for all unique n-grams in the text, composed of c (context, all tokens but the last one) and w (the last token).Conditional next-word entropy gives an additional, novel measure of diversity and repetitiveness: the more diverse text is, the less predictable is the next word given the previous word(s) is; on the other hand, the more repetitive the text, the more predictable is the next word given the previous word(s).
Syntactic Features: We used the D-Level Analyser (Lu, 2009) to evaluate syntactic variation and complexity of human references using the revised D-Level Scale (Lu, 2014).
We use the L2 Syntactic Complexity Analyzer (Lu, 2010) to extract 14 features of syntactic complexity that represent the length of production units, sentence complexity, the amount of subordination and coordination, and the frequency of particular syntactic structures.The full list of lexical and syntactic features is provided in Appendix A.

Classification Models
We benchmark four different machine learning models on each dataset with 10-fold crossvalidation.In cases of multiple samples per participant, we stratify by subject so that samples of the same participant do not occur in both the train and test sets in each fold.This is repeated for each  We consider Gaussian naïve Bayes (with equal priors), random forest (with 100 estimators and maximum depth 5), support vector Machine (with RBF kernel, penalty C = 1), and a 2-hidden layer neural network (with 10 units in each layer, ReLU activation, 200 epochs and Adam optimizer) (Pedregosa et al., 2011).Since the datasets have imbalanced classes, we identify F1 score with macro averaging as the primary performance metric.

Altering Text Samples
There can be three types of language perturbations at the word level: insertions, deletions, and substitutions on words.(Balagopalan et al., 2019) showed that deletions are more affected (significantly) than insertions and substitutions, so we likewise focus on deletions.Following Balagopalan et al. ( 2019), we artificially add deletion errors to original individual text samples at predefined levels of 20%, 40%, 60%, and 80%.To add the errors, we simply delete random words from original texts and transcripts at a specified rate.

Evaluating Change of Feature Values
In order to evaluate the change of feature values for different levels of text alterations, z-scores are used.We calculate z-scores of each individual feature in the transcripts with each level of alteration, with relation to the value of that feature in the original unaltered transcript.
where f eat x refers to a given syntactic or lexical feature extracted from a transcript with an alteration level of x = 20..80, µ and σ are computed over the entire original unaltered dataset.
Then, we average the individual z-scores across all the features within each feature group (syntactic and lexical) to get a z-score per feature group.
where N syn and N lex refer to the total number of syntactic and lexical features, respectively.

Evaluating change of feature predictive power
We extract ∆F 1 x , or change in classification F1 macro score, with x% alteration with respect to no alteration, for x = 20, 40, 60, 80, i.e, To identify the relative importance of syntactic or lexical features on classification performance, we estimate coefficients of effect for syntactic and lexical features.These coefficients are obtained by regressing to F1 deltas using the syntactic and lexical feature z-scores described in Section 3.5 for each alteration level.Thus, the regression equation can be expressed as: The training set for estimating α and β consists of ∆F 1 x ; (Z x syntactic , Z x lexical ) for x = 20, 40, 60, 80.

Comparing datasets
Three datasets used in our exploration represent different dimensions of lexical and syntactic complexity, and are unique in the nature of the tasks they involve and their type of language, as shown in Tab.1.AphB is the only dataset that includes speech samples of unstructured speech, while IMDBs is unique as it contains samples of written language, rather than transcripts of verbal speech.
In terms of lexical and syntactic complexity, it is interesting to note that AphB contains samples that are most lexically complex, while at the same time it is the most simple from the syntactic point of view.We associate this with the fact that AphB data come from partially unstructured tasks, where free speech increases the use of a more complex and more diverse vocabulary.IMDB is the most lexically rich dataset (see Table 2), with the highest ratio of uni-, bi-, and trigrams occuring only once.
IMDB is the most complex according to various measures of syntactic complexity: it has the highest scores with metrics associated with length of production unit, amount of subordination, coordination, and particular structures, and it also has the highest amount of complex sentences (sentences of D-level 5-7, as shown in Table 2).This may be explained by the fact it is the only dataset based on typewritten language.AphB has the lowest level of syntactic complexity, containing the highest amount of the simplest sentences (D-level 0), and lowest scores in other subgroups of syntactic features (see Table 2).
Next, we analyse if these variously distinct datasets have any common trend with regards to the vulnerability and robustness of lexical and syntactic feature groups.

Feature vulnerability
Following the method described in Section 3.5, we analyse if any of the feature groups (lexical or syntactic) is influenced more by text alterations.As shown in Figure 1, the values of lexical features are, on average, influenced significantly more than syntactic ones (Kruskal-Wallis test, p <0.05).Such a difference is observed in all three datasets individually (see Table 3).
The differences of z-scores between lexical and syntactic feature groups are higher for the IMDBs  dataset, which suggests that the difference is most visible either in healthy or in written language.These results suggest that lexical features are more vulnerable to simple text alterations, such as introduced deletion errors, while syntax-related features are more robust to these modifications.However, stronger changes of raw feature values do not necessarily mean that the resulting modified features become more or less important for classifiers.This leads us to inspect the impact of text alteration on feature predictive power.

Feature significance and the impact of alterations on feature predictive power
A simple method to understand the potential predictability of a feature is by looking at how different the feature value is between classes and whether this difference is statistically significant.This method was previously used in studies assessing automatic speech recognition for Alzheimer's (Zhou et al., 2016) and aphasia detection (Fraser et al., 2013).
We rank the p-values obtained, in each condition, from a two tailed non-parametric Kruskal-Wallis test performed on each feature between the two classes (healthy vs unhealthy in the DB and AphB datasets, and positive vs negative in IMDBs) and assign rank to each feature.It is interesting to note that lexical features occupy the overwhelming majority of first places across all datasets, showing that lexical features are significantly different between classes.We further analyse, following (Brunato et al., 2018), how the rank of each feature changes when different levels of text alterations are introduced.The maximum rank increase is higher on average for lexical features than for syntactic (see Figure 2 for details of rank changes in DemB dataset) across all datasets.The ratio of features that become insignificant after text alteration is also higher for lexical features rather than in syntactic on average across all datasets.As Figure 2 shows, the features with increased rank are those that were not initially significantly different between classes.The combination of these results suggest that not so important lexical features become more and more important with addition of text alterations, which may decrease the performance of classification.
The above method of calculating p-values is analogous to feature selection performed as a preprocessing step before classification.Although this step may provide some initial insights into feature significance, it does not guarantee the most significant features will be those having the most predictive power in classification.
We use the method described in Section 3.6 to evaluate the impact of text alteration on the features predictive power.The results in Table 4  show that syntactic features have more predictive power than lexical features.The lowest ratio is observed with DemB, and the AphB results are very close, suggesting that syntactic features are approximately twice as important than lexical features in predicting pathological speech.In healthy written language, the difference is even higher and reaches 7.15 for the random forest classifier.
In summary, the predictive power of syntactic features is much stronger than that of lexical features across three datasets and four main classifiers, which suggest the results can be generaliz-able across several different tasks and domains.

Example-based Analysis
As shown in previous sections, values of lexical features are on average more influenced by text alterations but this change does not affect classification as much as smaller value changes in syntactic features.Table 5 provides examples of two features, one lexical and one syntactic, their value changes when text samples are modified, and the associated change of the classifier's predictions.
The value of lexical feature cond entropy 3gram, showing conditional entropy calculated for trigrams, decreases by more than 50% when the text sample is modified by only 20%.This change is much higher than the associated absolute change of the syntactic feature C/S (that shows the number of clauses per sentence) that increases by 11% only on the same level of alteration.The prediction made by a classifier in the case of the lexical feature, however, is the same as the prediction of original transcript.Only when the general level of alteration reaches 60% and the value of the lexical feature decreases by more than 85%, the prediction becomes incorrect.In the case of syntactic features, the prediction already changes to incorrect with the general level of alteration of 20%, although the feature value is still quite close to the original one.Consider this sentence in the original transcript: She's holding the dish cloth in her right hand and the plate she is drying in her left.With 20% of errors it is converted to the following: She's holding the cloth in right hand the plate she drying in her left.It is clear that lexical features based on the frequency of uni-, bi-and trigrams are affected by this change, because quite a few words disappear in the second variant.In terms of syntactic structures, however, the sentence is not damaged much, as we still can see the same number of clauses, coordinate units, or verb phrases.Such an example helps explain the results in the previous sections.

Conclusions and Future Research
This paper shows that linguistic features of text, associated with syntactic and lexical complexity, are not equal in their vulnerability levels, nor in their predictive power.We study selective performance of these two feature aggregations on three distinct datasets to verify the generalizability of observations.
We demonstrate that values of lexical features are easily affected by even slight changes in text, by analysing z-scores at multiple alteration levels.Syntactic features, however, are more robust to such modifications.On the other hand, lower changes of syntactic features result in stronger effects on classification performance.Note that these patterns are consistently observed across different datasets with different levels of lexical and syntactic complexity, and for typewritten text and transcribed speech.
Several methods to detect and correct syntactic (Ma and McKeown, 2012) and lexical errors (Klebesits and Grechenig, 1994) as a postprocessing step for output from machine translation or ASR systems have been proposed in prior work.Since our analysis indicates that erroraffected syntactic features have a stronger effect on classification performance, we suggest imposing higher penalties on detecting and correcting syntactic errors than lexical errors in medical texts.A limitation in our study is that we focused on text alterations of a specific type, and the results were only tested on relatively small datasets.In future work, we will extend the analysis to other simple text alterations such as substitutions as well as adversarial text attacks (Alzantot et al., 2018).In addition, we will extend the current work to see how state-of-the-art neural network models, such as Bert, can handle text alterations as they capture lexical, syntactic and semantic features of the in-put text in different layers.Finally, note that the datasets considered in this study are fairly small (between 500 and 856 samples per domain).Efforts to release larger and more diverse data sets through multiple channels (such as challenges) in such domains as Alzheimer's or aphasia detection, and depression detection (Valstar et al., 2016;MacWhinney, 2007;Mozilla, 2019) need to be reinforced.
A List of Linguistic Features

Figure 1 :
Figure 1: Left: Change of syntactic and lexical feature values at different alteration levels, averaged across three datasets.Right: Impact of syntactic and lexical features on classification for DementiaBank, AphasiaBank and IMDBsentiment datasets, averaged across fours classifiers.

Figure 2 :
Figure 2: Change of lexical (left) and syntactic (right) feature rank when text alterations of different levels are introduced.Negative numbers denote decrease in rank, and positive numbers are an increase of rank.Blue cell colours denote the highest increase in rank, red (the highest decrease) and yellow (a smaller level of increase or decrease).Features are ranked based on p-values with the lowest p-value at the top.White cells show that features were not significantly different between classes in the original text samples, based on DemB dataset.

Table 2 :
Lexical complexity and richness, and syntactic complexity of the three datasets.Counts for n-grams appearing only once are shown as proportions of the total number of respective n-grams.Highest values on each line are typeset in bold.
(Chawla et al., 2002).The minority class is oversampled in the training set using SMOTE(Chawla et al., 2002)to deal with class imbalance.

Table 3 :
Change of feature values, per dataset and per level of text alterations.

Table 4 :
Ratio of coefficients, calculated asImportance syntactic /Importance lexical .Ratio higher than one indicates that syntactic features are more important for a classifier than lexical ones.

Table 5 :
is reaching into the cookie jar.he's falling off the stool.the little girl is reaching for a cookie.mother is drying the dishes.the sink is running over.mother's getting her feet wet.they all have shoes on.there's a cup two cups and a saucer on the sink.the window has draw withdrawn drapes.you look out on the driveway.there's kitchen cabinets.oh what's happening.mother is looking out the window.the girl is touching her lips.the boy is standing on his right foot.his left foot is sort of up in the air.mother's right foot is flat on the floor and her left she's on her left toe.&uh she's holding the dish cloth in her right hand and the plate she is drying in her left.I think I've run out of.yeah.reaching the cookie jar.he's falling off the stool.the little girl is reaching for cookie.mother is the dishes.the sink is over.mother's getting her feet.all have shoes.there's cup two cups a saucer on sink.window has draw withdrawn drapes.you look out on driveway.there's kitchen cabinets.oh what's happening.mother out the window.the girl is lips.the boy standing on.his left foot is sort of up in the air.mother's right foot is flat on the floor and left she's on her left toe.&uh she's holding the cloth in right hand the plate she drying in her left.think I've run out of.jar.he's falling the stool.the little is reaching a cookie.mother drying the dishes.the sink is running over.mother's her wet.all have shoes on.a two and a sink.the.you look driveway.there's kitchen.oh what's happening.mother out the window.the is her. is his foot.his left foot is sort of up air.foot is flat floor and she's her toe.&uh she's holding the dish cloth in right the she is drying in left.I think of. .falling stool.for cookie.the dishes.the.mother's feet wet.they have.a two cups a sink.the has withdrawn drapes.the.there's.oh.mother the window.the lips.the boy right.is sort of.right foot is flat on floor on her left.&uh cloth right hand and the she is in her left.yeah.the first place the the mother forgot to turn off the water and the water's running out the sink.and she's standing there.it's falling on the floor.the child is got a stool and reaching up into the cookie jar.and the stool is tipping over.and he's sorta put down the plates.and she's reaching up to get it but I don't see anything wrong with her though.yeah that's it.I can't see anything.the the mother forgot to turn off the water the water's out the sink.and standing there.it's falling floor. is got a stool and into the cookie jar.and the stool is tipping.and he's sorta down the plates.and she's reaching to get it but I don't see anything wrong with her though.that's it.I can't see anything.in the forgot the water the water's out the sink.and she's standing there.it's on the. the is got a stool and reaching up the. the is tipping.and he's sorta the.and she's reaching up to get but I her.yeah that's.I can't.Examples of two features, cond entropy 3gram and C/S, their value change when text samples are modified on the level of 20%, 40% and 60%, and associated classifier's predictions.Examples are provided using the DemB transcript samples and feature values.
Here, T-unit is defined as the shortest grammatically allowable sentences into which writing can be split or minimally terminable unit.Often, but not always, a T-unit is a sentence. 3