Deep Investigation of Cross-Language Plagiarism Detection Methods

This paper is a deep investigation of cross-language plagiarism detection methods on a new recently introduced open dataset, which contains parallel and comparable collections of documents with multiple characteristics (different genres, languages and sizes of texts). We investigate cross-language plagiarism detection methods for 6 language pairs on 2 granularities of text units in order to draw robust conclusions on the best methods while deeply analyzing correlations across document styles and languages.


Introduction
Plagiarism is a very significant problem nowadays, specifically in higher education institutions. In monolingual context, this problem is rather well treated by several recent researches (Potthast et al., 2014). Nevertheless, the expansion of the Internet, which facilitates access to documents throughout the world and to increasingly efficient (freely available) machine translation tools, helps to spread cross-language plagiarism. Crosslanguage plagiarism means plagiarism by translation, i.e. a text has been plagiarized while being translated (manually or automatically). The challenge in detecting this kind of plagiarism is that the suspicious document is no longer in the same language of its source. In this relatively new field of research, no systematic evaluation of the main methods, on several language pairs, for different text granularities and for different text genres, has been proposed yet. This is what we propose in this paper.
Contribution. The paper focus is on crosslanguage semantic textual similarity detection which is the main part (with source retrieval) in cross-language plagiarism detection. The evaluation dataset used (Ferrero et al., 2016) allows us to run a large amount of experiments and analyses. To our knowledge, this is the first time that full potential of such a diverse dataset is used for benchmarking. So, the paper main contribution is a systematic evaluation of cross-language similarity detection methods (using in plagiarism detection) on different languages, sizes and genres of texts through a reproducible evaluation protocol. Robust conclusions are derived on the best methods while deeply analyzing correlations across document styles and languages. Due to space limitations, we only provide a subset of our experiments in the paper while more result tables and correlation analyses are provided as supplementary material on a Web link 1 .
Outline. After presenting the dataset used for our study in section 2, and reviewing the stateof-the-art methods of cross-language plagiarism detection that we evaluate in section 3, we describe the evaluation protocol employed in section 4. Then, section 5.1 presents the correla-tion of the methods across language pairs, while section 5.2 presents a detailed analysis on only English-French pair. Finally, section 6 concludes this work and gives a few perspectives.

Dataset
The reference dataset used during our study is the new dataset 2 recently introduced by Ferrero et al. (2016). The dataset was specially designed for a rigorous evaluation of cross-language textual similarity detection. The different characteristics of the dataset are synthesized in Table 1, while Table 2 presents the number of aligned units by subcorpus and by granularity.
More precisely, the characteristics of the dataset are the following: • it is multilingual: it contains French, English and Spanish texts; • it proposes cross-language alignment information at different granularities: document level, sentence level and chunk level; • it is based on both parallel and comparable corpora (mix of Wikipedia, scientific conference papers, amazon product reviews, Europarl and JRC); • it contains both human and machine translated texts; • it contains different percentages of named entities; • part of it has been obfuscated (to make the cross-language similarity detection more complicated) while the rest remains without noise; • the documents were written and translated by multiple types of authors (from average to professionals); • it covers various fields.

Overview of State-of-the-Art Methods
Textual similarity detection methods are not exactly methods to detect plagiarism. Plagiarism is a statement that someone copied text deliberately without attribution, while these methods only detect textual similarities. There is no way 2 https://github.com/FerreroJeremy/ Cross-Language-Dataset of knowing why texts are similar and thus to assimilate these similarities to plagiarism.
At the moment, there are five classes of approaches for cross-language plagiarism detection. The aim of each method is to estimate if two textual units in different languages express the same message or not. Figure 1 presents a taxonomy of Potthast et al. (2011), enriched by the study of Danilova (2013), of the different cross-language plagiarism detection methods grouped by class of approaches. We only describe below the state-of-the-art methods that we evaluate in the paper, one for each class of approaches (those in bold in the Figure 1).

Cross-Language
Character N-Gram (CL-CnG) is based on Mcnamee and Mayfield (2004) model. We use the CL-C3G Potthast et al. (2011)'s implementation. Only spaces and alphanumeric characters are kept. Any other diacritic or symbol is deleted and the texts are lower-cased. The texts are then segmented into 3-grams (sequences of 3 contiguous characters) and transformed into tf.idf vectors of character 3-grams. The metric used to compare two vectors is the cosine similarity.
Cross-Language Conceptual Thesaurus-based Similarity (CL-CTS) aims to measure the semantic similarity using abstract concepts from words in textual units. We reuse the idea of Pataki (2012) which, for each sentence, build a bag-ofwords by getting all the available translations of each word of the sentence. For that, we use a linked lexical resource called DBNary (Sérasset, 2015). The bag-of-words of a sentence is the merge of the bag-of-words of the words of the sentence. After, we use the Jaccard distance (Jaccard, 1912) with fuzzy matching between two bag-ofwords to measure the similarity between two sentences.
Cross-Language Alignment-based Similarity Analysis (CL-ASA) was introduced for the first time by Barrón-Cedeño et al. (2008) and developed subsequently by Pinto et al. (2009). The model aims to determinate how a textual unit is potentially the translation of another textual unit using bilingual unigram dictionary which contains translations pairs (and their probabilities) extracted from a parallel corpus. Our lexical dictionary is calculated applying the IBM-1 model    Danilova (2013), of different approaches for cross-language similarity detection. (Brown et al., 1993) on the concatenation of TED 4 (Cettolo et al., 2012) and News 5 parallel corpora. We reuse the implementation of Pinto et al. (2009) that proposed a formula that factored the alignment function.
Cross-Language Explicit Semantic Analysis (CL-ESA) is based on the explicit semantic analysis model introduced for the first time by Gabrilovich and Markovitch (2007), which represents the meaning of a document by a vector based on the vocabulary derived from Wikipedia, to find a document within a corpus. It was reused by Potthast et al. (2008) in the context of cross-language document retrieval. Our implementation uses a part of Wikipedia, from which our test data was removed, to build the vector representations of the texts.
Translation + Monolingual Analysis (T+MA) consists in translating suspect plagiarized text back into the same language of source text, in order to operate a monolingual comparison between them. We use the Muhr et al. (2010)'s implementation which consists in replacing each word of one text by its most likely translations in the language of the other text, leading to a bags-of-words. We use DBNary (Sérasset, 2015) to get the translations. The metric used to compare two texts is a monolingual matching based on strict intersection of bags-of-words.
More recently, SemEval-2016 (Agirre et al., 2016) proposed a new subtask on evaluation of cross-lingual semantic textual similarity. Despite the fact that it was the first year that this subtask was attempted, there were 26 submissions from 10 teams. Most of the submissions relied on a machine translation step followed by a monolingual semantic similarity, but 4 teams tried to use learned vector representations (on words or sentences) combined with machine translation confidence (for instance the submission of Lo et al. (2016) or Ataman et al. (2016)). The method that achieved the best performance (Brychcin and Svoboda, 2016) was a supervised system built on a word alignment-based method proposed by Sultan et al. (2015). This very recent method is, however, not evaluated in this paper.

Evaluation Protocol
We apply the same evaluation protocol as in Ferrero et al. (2016)'s paper. We build a distance matrix of size N x M , with M = 1,000 and N = |S| where S is the evaluated sub-corpus. Each textual unit of S is compared to itself (actually, since this is cross-lingual similarity detection, each source language unit is compared to its corresponding unit in the target language) and to M -1 other units randomly selected from S. The same unit may be selected several times. Then, a matching score for each comparison performed is obtained, leading to the distance matrix. Thresholding on the matrix is applied to find the threshold giving the best F 1 score. The F 1 score is the harmonic mean of precision and recall. Precision is defined as the proportion of relevant matches (similar crosslanguage units) retrieved among all the matches retrieved. Recall is the proportion of relevant matches retrieved among all the relevant matches to retrieve. Each method is applied on each subcorpus for chunk and sentence granularities. For each configuration (i.e. a particular method applied on a particular sub-corpus considering a particular granularity), 10 folds are carried out by changing the M selected units.

Investigation of Cross-Language
Similarity Performances Table 3 brings together the performances of all methods on all sub-corpora for each pair of languages at chunk and sentence level. In both sub-tables, at chunk and sentence level, the overall F 1 score over all sub-corpora of one method in one particular language pair is given.

Across Language Pairs
As a preliminary remark, one should note that CL-C3G and CL-ESA lead to the same results for a given language pair (same performance if we reverse source and target languages) due to their symmetrical property. Another remark we can make is that methods are consistent across language pairs: best performing methods are mostly the same, whatever the language pair considered. This is confirmed by the calculation of the Pearson correlation between performances of different pairs of languages, from Table 3 and reported in Table 4. Table 4 represents the Pearson correlations between the different language pairs of the overall results of all methods on all sub-corpora. This result is interesting because some of these methods depend on the availability of lexical resources whose quality is heterogeneous across languages. Despite the variation of the source and target languages, a minimum Pearson correlation of 0.940 for EN→FR vs. FR→ES, and a maximum of 0.998 for EN→FR vs. EN→ES and ES→FR vs. FR→ES at chunk level is observed (see Table 4). For the sentence granularity, it is the same order of magnitude: the maximum Pearson correlation is 0.997 for ES→EN vs. EN→ES and ES→FR vs. FR→ES, and the minimum is 0.913 for EN→ES vs. FR→ES (see Table 4). In average the language pair EN→FR is 0.975 correlated with the other language pairs (0.980 at chunk-level and 0.971 at sentence-level), for instance. This correlation suggests the possibility to tune a method on one language and apply it to another language if needed.  Tables 3 and 4. No matter the source and target languages or the granularity, CL-C3G generally outperforms the other methods. Then CL-ASA, CL-CTS and T+MA are also closely efficient but their behavior depends on the granularity. Generally, CL-ASA is better at the chunk granularity, followed by CL-CTS and T+MA. On the contrary, CL-CTS and T+MA are slightly more effective at sentence granularity. One explanation for this is that T+MA depends on the quality of machine translation, which may have poor performance on isolated chunks, while a short length text unit benefits the CL-CTS and CL-ASA methods because of their formula which   will tend to minimize the number of false positives in this case. Anyway, despite these differences in ranking, the gap in term of performance values is small between these closest methods. For instance, we can see that when CL-CTS is more efficient than CL-C3G (ES→FR column at sentence level in Table 3 and Table 5 (b)), the difference of performance is very small (0.0068). Table 6 shows the Pearson correlations of the results (of all methods on all sub-corpora) by language pair between the chunk and the sentence granularity (correlations calculated from Table 3, between the EN→FR column at chunk level with the EN→FR column at sentence level, and so on).
We can see a strong Pearson correlation of the performances on the language pair between the chunk and the sentence granularity (an average of 0.9, with 0.907 for the EN→FR pair, for instance). This proves that all methods behave along a simi-    and sentence granularity performances (correlations also calculated from Table 3, between the CL-C3G line at chunk level with the CL-C3G line at sentence level, and so on), we notice that some methods exhibit a different behavior at both chunk and sentence granularities: for instance, this is the case for CL-ASA which seems to be really better at chunk level. In conclusion, we can say that the methods presented here may behave slightly differently depending on the text unit considered (chunk or sentence) but they behave practically the same no matter the languages of the compared texts are (as long as enough lexical resources are available for dealing with these languages).

Detailed Analysis for English-French
The previous sub-section has shown a consistent behavior of methods across language pairs (strongly consistent) and granularities (less strongly consistent). For this reason, we now propose a detailed analysis for different sub-corpora, for the English-French language pair -at chunk and sentence level -only. Providing these results for all language pairs and granularities would take too much space. Moreover, we also run those state-of-the-art methods on the dataset of the Spanish-English cross-lingual Semantic Textual Similarity task of SemEval-2016 (Agirre et al., 2016) and SemEval-2017 (Cer et al., 2017), and propose a shallower but equally rigorous analysis. However, all those results are also made available as supplementary material on our paper Web page. Table 8 shows the performances of methods on the EN→FR sub-corpora. As mentioned earlier, CL-C3G is in general the most effective method. CL-ESA seems to show better results on comparable corpora, like Wikipedia. In contrast, CL-ASA obtains better results on parallel corpora such as JRC or Europarl collections. CL-CTS and T+MA are pretty efficient and versatile too. It is also interesting to note that the results of the methods are well correlated between certain types of sub-corpora. For instance, the Pearson correlation of the performances of all methods between the TALN sub-corpus and the APR sub-corpus, is 0.982 at the chunk level, and 0.937 at the sentence level. This means that a method could be optimized on a particular corpus (for instance APR) and applied efficiently on another corpus (for instance TALN which is made of scientific conference papers).
Beyond their capacity to correctly predict a (mis)match, an interesting feature of the methods is their clustering capacity, i.e. their ability to correctly separate the positives (cross-lingual semantic textual similar units) and the negatives (textual units with different meaning) in order to minimize the doubts on the classification. To verify this phenomenon, we conducted another experience with a new protocol. We built a data subset by concatenating some documents of the previously presented dataset (Ferrero et al., 2016). More precisely we used 200 pairs of each sub-corpora at sentence level only. We compared 1000 English textual units to their corresponding unit in French, and to one other (not relevant) French unit. So, each English textual unit must strictly leads to one match and one mismatch, i.e. in the end, we have exactly 1000 matches and 1000 mismatches for a run. We repeat this experiment 10 times for each method, leading to 10 folds for each method.
The results of this experiment are reported on Table 9, that shows the average for the 10 folds of the Precision (P), the Recall (R) and the F 1 score of some state-of-the-art methods, reached at a certain threshold (T). The results are also reported in Figure 2, in the form of distribution histograms of the evaluated methods for 1000 positives and 1000 negatives (mis)matches. X-axis represents the similarity score (in percentage) computed by   Table 9: Precision (P), Recall (R) and F 1 score, reached at a certain threshold (T), of some stateof-the-art methods for a data subset made with 1000 positives and 1000 negatives (mis)matches -10 folds validation.
the method, and Y-axis represents the number of (mis)matches found for a given similarity score. In white, in the upper part of the figures, the positives (units that needed to be matched), and in black, in the lower part, the negatives (units that should not be matched).
Distribution histograms on Figure 2 highlights the fact that each method has its own fingerprint: even if two methods looks equivalent in term of performances (see Table 9), their clustering capacity, and so the distribution of their (mis)matches can be different. For instance, we can see that a random distribution is a very bad distribution (Figure 2 (a)). We can also see that CL-C3G has a narrow distribution of negatives and a broad distribution for positives (Figure 2 (c)), whereas the opposite is true for CL-ASA (Figure 2 (e)). Table 9 confirms this phenomenon by the fact that the decision threshold is very different for CL-ASA (0.762) compared to the other methods (around 0.1). This means that CL-ASA discriminates more correctly the positives that the negatives, when it seems to be the opposite for the other methods. For this reason, we can make the assumption that some methods are complementary, due to their different fingerprint. These behaviors suggest that fusion between these methods (notably decision tree based fusion) should lead to very promising results.

Conclusion
We conducted a deep investigation of crosslanguage plagiarism detection methods on a challenging dataset. Our results have shown a common behavior of methods across different language pairs. We revealed strong correlations across languages but also across text units considered. This means that when a method is more effective than another on a sufficiently large dataset, it is generally more effective in any other case. This also means that if a method is efficient on a particular language pair, it will be similarly efficient on another language pair as long as enough lexical resources are available for these languages.
We also investigated the behavior of the methods through the different types of texts on a particular language pair: English-French. We revealed strong correlations across types of texts. This means that a method could be optimized on a particular corpus and applied efficiently on another corpus.
Finally, we have shown that methods behave differently in clustering match and mismatched units, even if they seem similar in performance. This opens new possibilities for their combination or fusion.
More results supporting these facts are provided as supplementary material 6 .  : Distribution histograms of some state-of-the-art methods for 1000 positives and 1000 negatives (mis)matches. X-axis represents the similarity score (in percentage) computed by the method, and Y-axis represents the number of (mis)matches found for a given similarity score. In white, in the upper part of the figures, the positives (units that needed to be matched), and in black, in the lower part, the negatives (units that should not be matched). Cross-Language-Dataset/tree/master/study