Exploring the Intersection of Short Answer Assessment, Authorship Attribution, and Plagiarism Detection

In spite of methodological and conceptual parallels, the computational linguistic applications short answer scoring (Burrows et al., 2015), authorship attribution (Stamatatos, 2009), and plagiarism detection (Zesch and Gurevych, 2012) have not been linked in practice. This work explores the practical usefulness of the combination of features from each of these ﬁelds for two tasks: short answer assessment, and plagiarism detection. The experiments show that incorporating features from the other domain yields signiﬁcant improvements. A feature analysis reveals that robust lexical and semantic features are most informative for these tasks.


Introduction
Despite different ultimate goals, Short Answer Assessment, Plagiarism Detection, and Authorship Attribution are three domains of Computational Linguistics that share a range of methodology. However, these parallel have not been compared across domains. This work explores the intersection of these areas in a practical context. In the domain of authorship attribution, a set of texts and potential authors is given, and the goal is to "distinguish between texts written by different authors" (Stamatatos, 2009, page 1). In the domain of short answer assessment, tools are designed to assess the meaning of a short answer by comparing it to a reference answer (Burrows et al., 2015;, and thereby to its semantic appropriateness. In the domain of Plagiarism Detection, two main goals can be pursued (Clough and Stevenson, 2011): in extrinsic plagiarism detection, a source and potentially plagiarized texts are compared as a whole unit with methods from the domain of authorship attribution (Grieve, 2007). The goal of intrinsic plagiarism detection is to detect stylistic changes within one document (Zu Eissen and Stein, 2006). All three areas use textual similarity features on various levels of linguistic abstraction for nominal classifiers, but the distribution of features over three related dimensions differs (Zesch and Gurevych, 2012): style, content, and structure. While (learner language) short answer assessment systems put emphasis on content and ignore stylistic aspects, authorship attribution focuses on stylistic features. Plagiarism detection systems use both content, structural, and stylistic similarity features to classify texts as plagiarizing other documents or not. The main task for short answer assessment and plagiarism detection is to evaluate the existence and quality of paraphrases of a source text. This work explores the effect of features used in the field of authorship attribution and plagiarism detection features for short answer assessment, as well as the effect of short answer assessment features for plagiarism detection.

Data
For the experiments in the domain of short answer assessment, the Corpus of Reading comprehension Exercises in German  was used.
For the experiments in the domain of plagiarism detection, the Wikipedia Reuse Corpus (Clough and Stevenson, 2011) was selected for the experiments.
These resources were chosen since they are standard shared evaluation resources in these domains (Burrows et al., 2015;Zesch and Gurevych, 2012).

CREG
CREG-1032 is a short answer learner corpus containing student and reference answers to questions about reading comprehension texts. The longitudinal data was collected at two German programs in the United States at the Ohio State University (OSU) and the Kansas University (KU). The corpus exhibits a high variability of surface forms and semantic content in the student answers due to a variety of proficiency levels represented. Each student answer was annotated by two independent annotators with a binary diagnosis indicating the semantic correctness of the answer, independent of surface variations such as spelling mistakes or agreement errors. The corpus is balanced with respect to this diagnosis. Table  1 shows the distribution of student answers, target answers, and questions, as described in (Meurers et al., 2011b), who also showed that the OSU answers are significantly longer (average token length of 9.7 for KU versus 15.0 for OSU).

Wikipedia Reuse Corpus
The Wikipedia Reuse Corpus (WRC, (Clough and Stevenson, 2011)) represents different types of text reuse imitating different plagiarism types: copy and paste, light and heavy revision, and non-plagiarism. The plagiarism samples vary in the amount of revision and paraphrasing performed by participants. Table 1 shows the corpus' data distribution. The texts were not exclusively written by English native speakers and show similar surface/semantic variation as the CREG answers. With an average of 208 tokens in length, the answers are nearly 20 times as long as the answers in the CREG corpus, but referred to as "short answers" (Clough and Stevenson, 2011, page 1). Since Zesch and Gurevych (2012) showed empirical deficits in the text reuse conditions, all plagiarism labels were collapsed into a single category, rendering the task a binary classification, parallel to the CREG binary diagnoses. In this setting, the data is unbalanced: the majority class is the plagiarism class with 57 instances, whereas there are only 38 non-plagiarized documents.

Baseline Short Answer Assessment System
The UIMA-based CoMiC system (Meurers et al., 2011a;Meurers et al., 2011b) served as a framework for the experiments. It is an alignment-based short answer assessment system which aligns student to reference answers on different levels of linguistic abstraction in order to classify learner answers as (in)correct based on the quantity of different alignment types. CoMiC proved to be highly effective for both German and English (Burrows et al., 2015).
The CoMiC system follows a three-stage pipeline architecture (Bailey and Meurers, 2008;Meurers et al., 2011a): alignment, annotation, diagnosis. First, the system enriches the raw answer texts with linguistic annotation. Table 2 from (Meurers et al., 2011b) shows the different annotation tasks together with the respective tools.

Lexical Relations
GermaNet (Hamp and Feldweg, 1997) Similarity Score PMI-IR (Turney, 2001) Dependency Parsing MaltParser (Nivre et al., 2007)  In the second step, a globally optimal alignment configuration is selected by the Traditional Marriage Algorithm (Gale and Shapley, 1962). The system aligns tokens, NP chunks, and dependency triples. Tokens are aligned when they match on the surface, lowercased surface, synonym, semantic type, or lemma level. Only new elements (not verbatim given in the corresponding question) are aligned. In the final step, a range of features (Table 3, (Meurers et al., 2011b)) are extracted and fed to a machine learning component. In contrast to the original CoMiC system, predictions are made with WEKA's (Hall et al., 2009) memory based learner instead of the TiMBL memory based learner (Daelemans et al., 2007). The features denote directionalized quantities of alignments on different linguistic levels ('pct' = 'percentage of').

Extensions of the Baseline System
Stamatatos (2009) provides an extensive overview about approaches and stylometric features used in computerized authorship attribution. The features are divided into four subclasses. Table 4 based on (Stamatatos, 2009, page 3) lists all the features used, as well as their corresponding category (lexical/character/syntactic/semantic) and information about whether they are applied to one or two documents. If they are applicable to one document, then there exists a feature both for the student and for the target side in order to model the relation in this specific dimension of similarity, reflected in the prefix 'Student' or 'Target' in the feature names. Features applied to two documents are computed via cosine similarity between a vector for each answer holding the frequencies of the elements under consideration. The feature all features interpolated is a special overlap feature, for which first all frequencies of all feature extractors were added to one vector before the cosine similarity was applied (see Figure 1). The first m entries in the vector are lexical features, followed by n character features, etc.
The SpellCorr feature measures the token overlap between two texts using spelling corrected and surface forms. For each token, the system checks  whether the token or its lemma appears in a word list. If not, the system seaches the closest Levensthein match cosidering both the other document and the word list. All .arff feature files were generated with the same givenness constraints as the CoMiC baseline features and exported from there to WEKA.

Experimental Testing
The following orthogonal hypotheses were tested: 1. The accuracy for the learner language short answer assessment task increases when features from the domain of authorship attribution are added.
2. The accuracy for the plagiarism classification task increases when features from the short answer assessment system are added.

Method
The WEKA lazy iBk memory based learner with k=5-nearest neighbor search was run in a 10fold cross validation setting. Following Dietterich R to test whether an improvement over the baseline is statistically significant. Table 5 shows accuracies for the prediction of the semantic equivalence of learner answers and the prediction of plagiarism. For short answer assessment, the CoMiC features yielding an accuracy of 84.5% (KU) and 87.1% (OSU) are used as a baseline. For plagiarism detection, the set of all style features (Table 4, 84.2%) was used as the baseline. Table 5 shows that using only the baseline features from the other domain already yield significant improvements (92.6% for the WRC, 86.9% for CREG-1032-KU) over the baseline of in-domain features. Also the combination of both baseline feature sets yields improvements over the respective baseline. Even though the interpolated similarity feature on its own resulted in a surprisingly high accuracy for CREG-1032-OSU (87.9%), it only works in combination for the WRC corpus, resulting in the highest accuracy of all experiments (95.8%). Lexical features alone result in accuracies comparable to the baseline accuracies for both tasks. The character based features alone work better for short answer assessment, with even better results when combined with the baseline features. Semantic features have a higher impact for plagiarism detection, although for the CREG-1032-OSU data set, these features alone yield nearly the baseline accuracy.

Results
Feature Analysis. The information gain of features was computed in WEKA with the InfoGainAt-tributeEval filter with default parameters. Table 6 shows the ten most informative features for each data set. The most informative features are mostly lexical or character-based and thus contentmodeling features, where the most informative fea-  Table 5: Results for the binary classification tasks. * denotes a significant improvement (α = 0.1).
ture indicates the proportion of matched tokens when spelling-corrected versions are used. This is not surprising given the high surface variability in the corpora, and the design choices of the corpus creation to ignore form errors and focus on semantics.
6 Discussion and Related Work Grieve (2007) provided an extensive comparison of quantitative authorship attribution methods for extrinsic plagiarism detection. The observation that word and character-based metrics are most successful for extrinsic plagiarism detection can be confirmed by the present study. Clough and Stevenson (2011) tested two methods for classifying the texts in their Wikipedia Reuse Corpus: n-gram overlap and longest common subsequence. They report on an accuracy of 80% for predicting all four labels, and an accuracy of 94.7% for the binary classification. The present work outperformed the already very accu- TargetSynonym TargetSynonym   Table 6: Ten most informative features for the CREG-1032 and WRC data set.
rate system by Clough and Stevenson (2011) by almost one percent point with an accuracy of 95.8%. Zesch and Gurevych (2012) used a variety of content, structural, and stylistic features for the plagiarism classification task on the Wikipedia Reuse Corpus. They report an accuracy of 96.8% for the task of binary plagiarism classification. Meurers et al. (2011b) reported an accuracy of 84.6% for both the CREG-1032-KU and the CREG-1032-OSU data set with an early version of the CoMiC-DE system. Hahn and Meurers (2012) report an accuracy of 86.3% for the CREG corpus as a result of using the CoSeC system, which uses abstract semantic representations. Horbach et al. (2013) re-implemented the CoMiC system and tested the effect of considering the text instead of pre-defined target answers. In the best case, they reached an accuracy of 84.4% on the CREG corpus. Pado and Kiefer (2015) classified answers in the CREG corpus according to their similarity to a target answer. All answers above a threshold were classified as correct, resulting in an accuracy of 83.7% for CREG-1032. Ziai and Meurers (2014) made use of human-annotated information structural annotations for the CREG-1032-OSU data set. They obtained an accuracy of 90.3% for the CREG-1032-OSU data set for the CoMiC system. Rudzewitz (2015) augmented the CoMiC system with alignment weighting features measuring the importance of aligned elements with respect to the concrete task and general linguistic properties of aligned elements. This work reported an accuracy of 90.0% for the CREG-1032-OSU corpus. The difference of 1.2% to the present work warrants a combination of both approaches in future work.

Conclusions and Future Work
This article represents a pioneer work for linking the three research areas short answer assessment, authorship attribution, and plagiarism detection. The experiments confirmed the hypothesis formulated in the introduction that these areas share a similar methodology in terms of frameworks, tasks, and features. It was shown that semantics-based features modeling aspects of content, especially robust character-based features, were most effective for both short answer assessment and plagiarism detection, and that the most informative features for both corpora were surprisingly similar. The experiments also made evident that already rather simple features can yield reasonable results for these tasks. Both research hypotheses formulated in section 5 could be confirmed, respectively the null hypothesis could be rejected: features from authorship attribution yielded significant improvements for the task of learner language assessment, and features from learner language assessment yielded significant improvements for the task of plagiarism detection. However, it has to be noted that not all features are strictly task-specific, and also applicable to other NLP tasks. A comparison with related work showed that the results are comparable to current state-of-the-art approaches, although there is still room for improvement. Future work therefore will explore the usage of more features, more elaborate machine learning algorithms, and automatic feature selection techniques. In addition, more corpora from either domain will be used to obtain a broader evaluation perspective. Especially stylistic features modeling for example stopword patterns as well as longest common subsequence features are hypothesized to be beneficial for the task of plagiarism detection since they model stylistic rather than semantic properties.