Improved Evaluation Framework for Complex Plagiarism Detection

Plagiarism is a major issue in science and education. Complex plagiarism, such as plagiarism of ideas, is hard to detect, and therefore it is especially important to track improvement of methods correctly. In this paper, we study the performance of plagdet, the main measure for plagiarim detection, on manually paraphrased datasets (such as PAN Summary). We reveal its fallibility under certain conditions and propose an evaluation framework with normalization of inner terms, which is resilient to the dataset imbalance. We conclude with the experimental justification of the proposed measure. The implementation of the new framework is made publicly available as a Github repository.


Introduction
Plagiarism is a problem of primary concern among publishers, scientists, teachers (Maurer et al., 2006). It is not only about text copying with minor revisions but also borrowing of ideas. Plagiarism appears in substantially paraphrased forms and presents conscious and unconscious appropriation of others' thoughts (Gingerich and Sullivan, 2013). This kind of borrowing has very serious consequences and can not be detected with common Plagiarism Detection Systems (PDS). That is why detection of complex plagiarism cases comes to the fore and becomes a central challenge in the field.

Plagiarism Detection
Most of the contributions to the plagiarism text alignment were made during the PAN annual track for plagiarism detection held from 2009 to 2015. The latest winning approach (Sanchez-Perez et al., 2014) achieved good performance on all the plagiarism types except the Summary part. Moreover, this type of plagiarism turned out to be the hardest for all the competitors.
In a brief review, Kraus emphasized (2016) that the main weakness of modern PDS is imprecision in manually paraphrased plagiarism and, as a consequence, the weak ability to deal with real-world problems. Thus, the detection of manually paraphrased plagiarism cases is a focus of recently proposed methods for plagiarism text alignment. In the most successful contributions, scientists applied genetic algorithms (Sanchez-Perez et al., 2018;Vani and Gupta, 2017), topic modeling methods (Le et al., 2016), and word embedding models (Brlek et al., 2016) to manually paraphrased plagiarism text alignment. In all of these works, authors used PAN Summary datasets to develop and evaluate their methods.

Text Alignment
In this work we deal with an extrinsic text alignment problem. Thus, we are given pairs of suspicious documents and source candidates and try to detect all contiguous passages of borrowed information. For a review of plagiarism detection tasks, see Alzahrani et al. 2012.

Datasets
PAN corpora of datasets for plagiarism text alignment is the main resource for PDS evaluation. This collection consists of slightly or substantially different datasets used at the PAN competitions since 2009 to 2015. We used the most recent 2013 (Potthast et al., 2013) and 2014 (Potthast et al., 2014) English datasets to develop and evaluate our models and metrics. They consist of copy&paste, random, translation, and summary plagiarism types. We consider only the last part, as it exhibits the problems of plagdet framework to the greatest extent.

Evaluation Metrics
Standard evaluation framework for text alignment task is plagdet (Potthast et al., 2010), which consists of macro-and micro-averaged precision, recall, granularity and the overall plagdet score. In this work, we consider only the macro-averaged metrics, where recall can be defined as follows: and precision can be defined through recall as follows: where S and R are true plagiarism cases and system's detections, respectively.
Single case recall rec single macro (s, R s ) is defined as follows: where R s is the union of all detections of a given case s.

Problem Statement
In this section, we explore problems representative to several manual plagiarism datasets (mainly, Summary part of PAN corpora), and show that the plagdet framework can fail to correctly estimate PDS quality on these datasets.

Dataset Imbalance
The PAN Summary datasets turn out to be highly imbalanced.
• Source part of each plagiarism case takes up the whole source document: ∀s ∈ S ∃d src ∈ D src : s src = d src .
(3)  Figure 1: Single case recall computation for text alignment task. Note the imbalance in this case: plagiarism part s plg is much shorter than source part s src .
• For any given case, its plagiarism part is much shorter than its source part 1 : ∀s ∈ S : |s plg | << |s src |.
As these datasets are publicly available, anyone can figure out these details and, therefore, construct an algorithm where statements 3 and 4 are true for detections R as well.
Let us now consider a true case s, its detections R s and its source document d src . Then single case recall for PAN Summary document will be equal to: (here we used that and s src = (R s ) src = d src ). Since plagiarism part s plg of the case s is much shorter than source document d src , the term |d src | dominates numerator and denominator in eq. 5, which results in inadequately high documentlevel precision and recall on PAN Summary datasets.
Other datasets for manual plagiarism detection display the similar properties, however, not to the PAN Summary extent. Examples include: Palkovskii15, Mashhadirajab et al. 2016, andSochenkov et al. 2017.

Discussion
The important question is whether such dataset imbalance reflects the real-world plagiarizers' behavior.
There is an evidence that performing length unrestricted plagiarism task people tend to make texts shorter, however, not to the PAN Summary extent (Barrón-Cedeño et al., 2013). Moreover, we can find some supporting theoretical reasons. Firstly, summarization and paraphrasing are the only techniques students are taught to use for the transformation of texts. Hence, they can use summarization to plagiarize intellectually. Secondly, in the cases of inadvertent plagiarism and the plagiarism of ideas details of source texts are usually omitted or forgotten. This should also lead to smaller plagiarized texts. Though we can find some reasons, such huge imbalance does not seem to be supported enough and may be considered as a bias.
Let us take a fresh look at a source part of rec single macro . We assume that |ssrc∩(Rs)src| |ssrc| ∈ [0; 1], and this is actually the case if: But, according to lemma 4.1, we see that:  Degenerate intersection lemma. Intuitively, lower bound (a) is achieved when e 1 and e 2 are "farthest" away from each other in d, and upper bound (b) is achieved when e 1 ⊆ e 2 (or e 2 ⊆ e 1 ). This results in a smaller possible value range of intersection length and, therefore, range of precision and recall values. Because of (3), on PAN Summary this leads to the extreme case of a src = b src = |d src |, which causes precision and recall to take constant values on the source part of the dataset.

Normalized Single Case Recall
To address issues of dataset imbalance (section 4.1) and degenerate intersection (section 4.2), we propose the following normalized version of single case recall nrec single macro (s, R s ) for macroaveraged case: where: i ∈ {plg, src}.

Normalized recall, precision and plagdet
The result of (1) where every rec single macro (s, R s ) term is replaced for nrec single macro (s, R s ) is defined as normalized recall nrec macro (S, R). Normalized precision nprec macro (S, R) can be obtained from normalized recall using eq. 2. Normalized macro-averaged plagdet, or normplagdet, is defined as follows: where F α is the weighted harmonic mean of nprec macro (S, R) and nrec macro (S, R), i.e. the F α -measure, and gran(S, R) is defined as in Potthast et al. 2010. 6 Adversarial models To justify the proposed evaluation metrics, we construct two models, M1 and M2, which achieve inadequately high macro-averaged precision and recall.

Preprocessing
We represent each plagiarism document d plg as a sequence of sentences, where each sentence sent d plg ,i ∈ d plg is a set of tokens. Each source document d src will be represented as a set of its tokens.
For each sentence sent d plg ,i we also define a measure of similarity sim d plg ,dsrc,i with respect to the source document as:

Models
Our models are rule-based classifiers, which proceed in three steps for each pair of documents d plg , d src : 1. Form a candidate set according to similarity score: cand = i|sim d plg ,dsrc,i > 3 4 .
2. Find the candidate with highest similarity score (if it exists): best = arg max i sim d plg ,dsrc,i |i ∈ cand .
3. (M1) Output sentence best as a detection (if it exists).
(M2) Output sentences i|i = best as a detection (or all sentences if best does not exist).

Results and Discussion
We evaluated our adversarial models as well as several state-of-the-art algorithms, whose source code was available to us, using plagdet and normplagdet scores on all PAN Summary datasets available to date. In plagdet score comparison (Table 1) we included additional state-of-the-art algorithms' results (marked by * ), borrowed from respective papers. Proposed models M1 and M2 outperform all algorithms by macro-averaged plagdet and recall measures on almost every dataset. Despite their simplicity, they show rather good results.
On the contrary, while measuring normplagdet score (Table 2), M1 and M2 exhibit poor results, while tested state-of-the-art systems evenly achieve better recall and normplagdet scores. These experimental results back up our claim that normplagdet is more resilient to dataset imbalance and degenerate intersection attacks and show that tested state-of-the-art algorithms do not exploit these properties of PAN Summary datasets.
The code for calculating normplagdet metrics, both macro-and micro-averaged, is made available as a Github repository 2 . We preserved the command line interface of plagdet framework to allow easy adaptation for existing systems.

Conclusion
Our paper shows that the standard evaluation framework with plagdet measure can be misused to achieve high scores on datasets for manual plagiarism detection.
We constructed two primitive models that achieve state-of-the-art results for detecting plagiarism of ideas by exploiting flaws of standard plagdet. Finally, we proposed a new framework, normplagdet, that normalizes single case scores to prevent misuse of datasets such as PAN Summary, and proved its correctness experimentally. The proposed evaluation framework seems beneficial not only for plagiarism detection but for any other text alignment task with imbalance or degenerate intersection dataset properties.