Is writing style predictive of scientific fraud?

The problem of detecting scientific fraud using machine learning was recently introduced, with initial, positive results from a model taking into account various general indicators. The results seem to suggest that writing style is predictive of scientific fraud. We revisit these initial experiments, and show that the leave-one-out testing procedure they used likely leads to a slight over-estimate of the predictability, but also that simple models can outperform their proposed model by some margin. We go on to explore more abstract linguistic features, such as linguistic complexity and discourse structure, only to obtain negative results. Upon analyzing our models, we do see some interesting patterns, though: Scientific fraud, for examples, contains less comparison, as well as different types of hedging and ways of presenting logical reasoning.


Introduction
Cases of scientific misconduct are identified every year. Scientific papers are retracted because of errors, or for suspected fraud, ranging from plagiarism and minor manipulations to faking the data and disguising the results. It has been shown that, however, among the retracted articles indexed in PubMed, only 21.3% are retracted due to error, while 67.4% were removed due to misconduct, among which suspected fraud amounts to 43.4%, the others being due to duplicate publications or plagiarism (Fang et al., 2012).
In a recent paper, Markowitz and Hancock (2015) proposed the first analysis of writing style in fraudulent papers across authors and disciplines. They approached the question of whether these authors have a specific writing style, from a psychological perspective. They found that these papers exhibit a higher rate of jargon, make a higher use of references, and have a lower readability rate, suggesting that the authors try to obfuscate their writing, making them harder to read and analyze. They report classification results using a leave-one-out strategy over the dataset, with a classification accuracy of 57.2%. As suggested in the paper, we propose to improve this performance by evaluating different classification models.
In this paper, we first show that much better results can be obtained using a simple bag-of-words representation and Logistic Regression. Our best model is a syntax-enhanced trigram-model. We also show that the leave-one-out strategy used by the authors leads to an over-estimation of model precision, and we report new results based on a more robust strategy, taking into account the low number of instance available; namely a nested cross-validation (Varma and Simon, 2006;Scheffer, 1999). We also considered semantic and discourse features, but we did not observe improvements with such features.
Of course, that a bag-of-words model outperforms a model based on psychologically motivated features, may simply be the result of overfitting. We present an extensive feature analysis to validate our models, as well as to test psychologically motivated hypotheses from the literature.
Contributions (i) We present a simple model with high accuracy, and show that it implicitly captures the previously-proposed psychologicallymotivated features. (ii) We show that adding semantics and discourse features does not lead to improvements. (iii) On the other hand, our feature analysis suggests that the models do learn to focus on concepts that are intuitively related to scientific misconduct, e.g., that scientific fraud contains less comparison. Markowitz and Hancock (2015) were the first to study writing style in fraudulent papers. They gathered a corpus of 253 articles indexed in PubMed that have been retracted for fraudulent data, as well as 253 unretracted papers (see Section 3). They define five indicators of obfuscation, and show that fraudulent papers tend to demonstrate a higher rate of linguistics obfuscation, corresponding to a lower readability, an higher use of jargon and a higher degree of abstraction. Linked to studies on deception identification, they also report a lower rate of positive emotion terms and a higher rate of causal terms (e.g. "depend", "induce", "manipulated") in fraudulent papers. The readability score was computed using Coh-Metrix (McNamara et al., 2013), while the other scores were based on the Linguistic Inquiry and Word Count (LIWC; (Pennebaker et al., 2007)), a dictionary associating a word to various scores such as abstraction (a word is considered as jargon if it is not found in the dictionary). Finally, they report 57.2% in accuracy using these five indicators as features, a score that we show is probably a little too optimistic, since it is based on a leaveone-out procedure (see Section 5). We extend their work by first showing that a simple unigram model outperforms their model by a large margin, but also by considering more indicators, including discourse and syntax, and by showing, as mentioned, that their scores were probably over-estimated due to their validation strategy.

Related work
Our work is also inspired by another related field of research concerned with deception detection. Mihalcea and Strapparava (2009) built three datasets consisting of 100 true and 100 deceptive short statements on three different topics (abortion, death penalty, best friend). Using only unigrams, they report 70.8% accuracy in a 10-fold cross validation. They found that specific word classes, as defined in the LIWC, were predictive of deceptive texts, especially classes indicating detachment from self or related to certainty. Feng et al. (2012a) investigate syntactic features, using lexicalized and unlexicalized production rules in addition to shallow features (words unigram and bigram, and POS unigram). They experiment on truthful and deceptive reviews from TripAdvisor, either gold (Ott et al., 2011) or retrieved using a fake review detector (Feng et al., 2012b), reviews automatically extracted from Yelp, and the corpus introduced in (Mihalcea and Strapparava, 2009). They report scores between 64.3 and 91.2% accuracy, depending on the dataset. They found that, for all datasets, syntax helps, and that deceptive reviews more frequently use VP, SBAR and WHADVP.
We also consider n-gram features, syntactic features, as well as discourse features. Our task is however a bit different, since authors of fraudulent papers are not directly lying, rather trying to conceal their fraud. Moreover, our documents are longer and are of a different genre, i.e. scientific articles.

Data
We use the dataset proposed in (Markowitz and Hancock, 2015) containing 253 publications retracted for data fraud and 253 unretracted publications. These publications were taken from the PubMed archives from 1973 through 2013.
The unretracted papers are extracted by considering one retracted paper and taking a control paper published the same year, in the same journal, and with some common keywords when possible. When no such paper exists (around 19% of the papers), a paper from an adjacent year, or using the same words in the abstract, was selected.
The data used is the pre-processed version presented in (Markowitz and Hancock, 2015): Words were converted from British English to American English forms. Brackets, parentheses, and percent signs were removed. Periods were removed from certains words, such as 'Dr.' or nc.'. The documents only contain the main body text (no section titles, figures, or tables).

Methodology
We investigate different types of features, from ngrams to discourse. In large vocabulary feature spaces, we perform feature reduction, to reduce sparsity. We then provide an analysis of the features to identify the most informative indicators.

Word features
We use word n-grams as features, with n ∈ {1, 2, 3}. In order to test the hypotheses presented in previous studies, we also use lexicons to extract information about the tokens. We use the General Inquirer (Stone and Kirsh, 1966) to extract words expressing a polarity -the features built represent the polarity between positive, negative, both and neutral -, and words corresponding to a causal term. We also use this lexicon to map the words to a more general semantic category (Inquirer).
We identify all the personal pronouns using manually defined lists. Finally, we also include as features hedge and modal words, also using a pre-defined list. 1 Syntactic features In order to obtain syntactic information, we parse the data using UDPipe 2 (Straka et al., 2016), and a prebuilt model available online for English. 3 We follow (Johannsen et al., 2015) in extracting all subtrees of up to three tokens (treelets).
Discourse features Finally, we automatically annotate all the data with discourse connectives and explicit discourse relations using simple models trained on the Penn Discourse Treebank (PDTB) (Prasad et al., 2008), a corpus of news articles from the Wall Street Journal. Discourse coherence is an indicator of the quality of a text (Lin et al., 2011), of its reasoning that could reveal an attempt to deceive. Some specific semantic relations could also be good indicators (e.g. Cause).
We used models to identify the discourse connectives (Connectives) and to identify the explicit discourse relation 4 (Explicit relations) they trigger, either among the 4 coarse-grained classes (lvl1) at the top of the hierarchy of sense or using the 11 more fine-grained relations at the second level (lvl2). Our models use Logistic Regression and the connective and the surrounding words and their POS as features (Lin et al., 2009). They are trained on the sections 2-21 of the PDTB. Our results on the section 23 are close to the state-of-theart (Pitler and Nenkova, 2009;Pitler et al., 2008;Lin et al., 2014): 92.9% in accuracy for identifying the connectives, 95.1% for the level-1 relations, and 86.2% for the level-2 relations.
Feature analysis In addition to presenting accuracies obtained with these feature sets, we  also perform a feature analysis. For this purpose we use a combination of correlation coefficients, logistic regression coefficients, and stability selection (Meinshausen and Bühlmann, 2010) -a method that consists in repeatedly fitting the model across different random subsamples, and counting how many times features are selected in 1 -regularized logistic regression models. For stability selection, we use the implementation available in scikit-learn (Pedregosa et al., 2011) with its default parameters, run it on the whole dataset and keep features selected more than 50% of the time.
We indicate the size of the original vocabulary and the number of selected features for each category in Table 1.

Classification
Representation We test separately count vectorizations with each set of features -unigrams, 2-3-grams, polarity, causality, Inquirer categories, pronouns (grouping per person, or considering each lemma), treelets, connectives, hedge words, level-1 relations and level-2 relations, and combinations of these features.
Model We use a binary logistic regression classifier, optimizing the norm ( 1 or 2 ) and strength (c ∈ {0.001, 0.005, 0.01, 0.1, 0.5, 1, 5, 10, 100} of the regularization term on held-out data. Validation schemes Markowitz and Hancock (2015) report results with a leave-one-out strategy (LOO). However, LOO often under-estimates the error rate. We compare with a nested crossvalidation procedure that can provide an almost   unbiased estimate of the true error (Varma and Simon, 2006;Scheffer, 1999). Specifically, we use two cross-validation loops: the inner loop is used for tuning the hyperparameters, and the outer loop estimates the generalization error. The data are first split into N folds, the fold k (1 ≤ k ≤ N ) is the current evaluation set, and the N − 1 other folds are used as training data and split into M folds used for model fitting. The best model is then evaluated on fold k. Final scores are averages over the N folds.
For comparison with Markowitz and Hancock (2015), we report performance with LOO and with nested cross-validation using LOO as outer loop, the inner loop being a random 5-fold crossvalidation. We repeat each evaluation 10 times, and report a mean over these trials.

Results
Our results are summarized in Table 2. Our results are generally higher than the 57.2% reported in (Markowitz and Hancock, 2015), with at best 71.7% with a nested LOO and a single group of features (unigrams or treelets) and 76.0% when n-grams and treelets are combined.
Using all the n-grams already leads to a better accuracy score (+1.3%) compared to using only unigrams (73.0% in accuracy for 1+2-3-grams with N-LOO). On the other hand, combining discourse features to the n-grams does not allow improvements over using only the n-grams (72.8% with N-LOO for 1+2-3-grams+Connectives+Explicit Relations lvl1).
The scores obtained with LOO are overestimate performance, compared to nested crossvalidation, see for example Figure 1: Even if the differences are low, they are consistent across the trials and the feature sets.

Feature analysis
We use Pearson's ρ (w. Bonferroni correction) to establish what features are predictive of fraud and non-fraud. We report the values for the features cited in Table 3.
Hedging There is an interesting contrast between adverbial hedges (conceivably, presumably, surely, effectively) and verbal hedges (suggest) indicative of fraud, and adverbial hedges (practically, occasionally) and verbal hedges assume, speculate) indicative of non-fraud: It seems adverbs and verbs used in fraud are for interpreting the data on behalf of the reader, whereas the adverbs and verbs indicative of fraud are more observer-aware (e.g., we speculate). This suggest that a fraud strategy is to hide observers bias, rather than being explicit about it.
Comparison Both the discourse relation and the Inquirer class for comparison are predictive of non-fraud. Scientific fraud thus seems less likely to compare. On the other hand, neither the causal relations or the presence of causal terms were significantly linked to fraudulent papers. Therefore vs. since A peculiar, but statistically significant difference between fraud and non-fraud articles, is that fraud articles prefer therefore over since, and vice versa. We speculate that it may be a fraud strategy to make the reasoning more verbose by separating out premises (because the authors are, consciously or not, afraid the readers will not accept them). This is in slight contrast with or qualifies the main hypothesis in Markowitz and Hancock (2015), that fraudulent writers try to obfuscate their writing.
Other markers of fraud Many technical concepts were highly correlated with fraud, but we suspect these are cases of overfitting. More interestingly, the bigram described previously was among the top-5 most highly correlated features, indicating fraud. From our syntactic treelets, proper nouns and interjections were both slightly indicative of fraud (p < 0.01).
Other markers of non-fraud From our syntactic treelets, conjunctions of numbers were indicative of non-fraud, suggesting maybe a higher level of technical detail. Non-fraud articles are also more likely to use the pronoun they, as compared to we, compared to fraud papers.

Conclusion
We show that a simple unigram model outperforms previous work on scientific fraud detection. Overall, more high-level linguistic features, beyond syntactic treelets, do not lead to improvements, but we also presented a feature analysis showing, for example, that comparison and explanation (at the semantic and discourse level) are indicators of non-fraud, and that fraudulent writing uses slightly different hedging strategies.  Table 3: Pearson ρ and original p-value (before Bonferroni correction) for some features.