Using Word Embedding for Cross-Language Plagiarism Detection

This paper proposes to use distributed representation of words (word embeddings) in cross-language textual similarity detection. The main contributions of this paper are the following: (a) we introduce new cross-language similarity detection methods based on distributed representation of words; (b) we combine the different methods proposed to verify their complementarity and finally obtain an overall F1 score of 89.15% for English-French similarity detection at chunk level (88.5% at sentence level) on a very challenging corpus.


Introduction
Plagiarism is a very significant problem nowadays, specifically in higher education institutions. In monolingual context, this problem is rather well treated by several recent researches (Potthast et al., 2014). Nevertheless, the expansion of the Internet, which facilitates access to documents throughout the world and to increasingly efficient (freely available) machine translation tools, helps to spread cross-language plagiarism. Cross-language plagiarism means plagiarism by translation, i.e. a text has been plagiarized while being translated (manually or automatically). The challenge in detecting this kind of plagiarism is that the suspicious document is no longer in the same language of its source. We investigate how distributed representations of words can help to propose new cross-lingual similarity measures, helpful for plagiarism detection. We use word embeddings (Mikolov et al., 2013) that have shown promising performances for all kinds of NLP tasks, as shown in Upadhyay et al. (2016), Ammar et al. (2016) and Ghannay et al. (2016), for instance.

Contributions.
The main contributions of this paper are the following: • we augment some state-of-the-art methods with the use of word embeddings instead of lexical resources; • we introduce a syntax weighting in distributed representations of sentences, and prove its usefulness for textual similarity detection; • we combine our methods to verify their complementarity and finally obtain an overall F 1 score of 89.15% for English-French similarity detection at chunk level (88.5% at sentence level) on a very challenging corpus (mix of Wikipedia, conference papers, product reviews, Europarl and JRC) while the best method alone hardly reaches F 1 score higher than 50%.

Dataset
The reference dataset used during our study is the new dataset recently introduced by Ferrero et al.
(2016) 1 . The dataset was specially designed for a rigorous evaluation of cross-language textual similarity detection. More precisely, the characteristics of the dataset are the following: • it is multilingual: it contains French, English and Spanish texts; • it proposes cross-language alignment information at different granularities: document level, sentence level and chunk level; • it is based on both parallel and comparable corpora (mix of Wikipedia, conference papers, product reviews, Europarl and JRC); • it contains both human and machine translated texts; • it contains different percentages of named entities; • part of it has been obfuscated (to make the cross-language similarity detection more complicated) while the rest remains without noise; • the documents were written and translated by multiple types of authors (from average to professionals) and cover various fields.
In this paper, we only use the French and English sub-corpora.

Overview of State-of-the-Art Methods
Plagiarism is a statement that someone copied text deliberately without attribution, while these methods only detect textual similarities. However, textual similarity detection can be used to detect plagiarism.
The aim of cross-language textual similarity detection is to estimate if two textual units in different languages express the same or not. We quickly review below the state-of-the-art methods used in this paper, for more details, see Ferrero et al. (2016).
Cross-Language Character N-Gram (CL-CnG) is based on Mcnamee and Mayfield (2004) model. We use the Potthast et al. (2011) implementation which compares two textual units under their 3-grams vectors representation.
Cross-Language Conceptual Thesaurus-based Similarity (CL-CTS) (Pataki, 2012) aims to measure the semantic similarity using abstract con-cepts from words in textual units. In our implementation, these concepts are given by a linked lexical resource called DBNary (Sérasset, 2015).

Cross-Language Alignment-based Similarity
Analysis (CL-ASA) aims to determinate how a textual unit is potentially the translation of another textual unit using bilingual unigram dictionary which contains translations pairs (and their probabilities) extracted from a parallel corpus (Barrón-Cedeño et al. (2008), Pinto et al. (2009)).
Cross-Language Explicit Semantic Analysis (CL-ESA) is based on the explicit semantic analysis model (Gabrilovich and Markovitch, 2007), which represents the meaning of a document by a vector based on concepts derived from Wikipedia. It was reused by Potthast et al. (2008) in the context of cross-language document retrieval.

Translation + Monolingual Analysis (T+MA)
consists in translating the two units into the same language, in order to operate a monolingual comparison between them (Barrón-Cedeño, 2012). We use the Muhr et al. (2010) approach using DBNary (Sérasset, 2015), followed by monolingual matching based on bags of words.

Evaluation Protocol
We apply the same evaluation protocol as in Ferrero et al. (2016)'s paper. We build a distance matrix of size N x M , with M = 1,000 and N = |S| where S is the evaluated sub-corpus. Each textual unit of S is compared to itself (to its corresponding unit in the target language, since this is cross-lingual similarity detection) and to M -1 other units randomly selected from S. The same unit may be selected several times. Then, a matching score for each comparison performed is obtained, leading to the distance matrix. Thresholding on the matrix is applied to find the threshold giving the best F 1 score. The F 1 score is the harmonic mean of precision and recall. Precision is defined as the proportion of relevant matches (similar cross-language units) retrieved among all the matches retrieved. Recall is the proportion of relevant matches retrieved among all the relevant matches to retrieve. Each method is applied on each EN-FR sub-corpus for chunk and sentence granularities. For each configuration (i.e. a particular method applied on a particular sub-corpus considering a particular granularity), 10 folds are carried out by changing the M selected units.

Proposed Methods
The main idea of word embeddings is that their representation is obtained according to the context (the words around it). The words are projected on a continuous space and those with similar context should be close in this multi-dimensional space. A similarity between two word vectors can be measured by cosine similarity. So using wordembeddings for plagiarism detection is appealing since they can be used to calculate similarity between sentences in the same or in two different languages (they capture intrinsically synonymy and morphological closeness). We use the MultiVec (Berard et al., 2016) toolkit for computing and managing the continuous representations of the texts. It includes word2vec (Mikolov et al., 2013), paragraph vector (Le and Mikolov, 2014) and bilingual distributed representations (Luong et al., 2015) features. The corpus used to build the vectors is the News Commentary 2 parallel corpus. For training our embeddings, we use CBOW model with a vector size of 100, a window size of 5, a negative sampling parameter of 5, and an alpha of 0.02.

Improving Textual Similarity Using
Word Embeddings (CL-CTS-WE and CL-WES) We introduce two new methods. First, we propose to replace the lexical resource used in CL-CTS (i.e. DBNary) by distributed representation of words. We call this new implementation CL-CTS-WE. More precisely, CL-CTS-WE uses the top 10 closest words in the embeddings model to build the BOW of a word. Secondly, we implement a more straightforward method (CL-WES), which performs a direct comparison between two sentences in different languages, through the use of word embeddings. It consists in a cosine similarity on distributed representations of the sentences, which are the summation of the embeddings vectors of each word of the sentences. Let U a textual unit, the n words of the unit are represented by u i as: If U x and U y are two textual units in two different languages, CL-WES builds their (bilingual) common representation vectors V x and V y and applies a cosine similarity between them.
A distributed representation V of a textual unit U is calculated as follows: where u i is the i th word of the textual unit and vector is the function which gives the word embedding vector of a word. This feature is available in MultiVec 3 (Berard et al., 2016).

Cross-Language Word Embedding-based Syntax Similarity (CL-WESS)
Our next innovation is the improvement of CL-WES by introducing a syntax flavour in it.
Let U a textual unit, the n words of the unit are represented by u i as expressed in the formula (1). First, we syntactically tag U with a part-of-speech tagger (TreeTagger (Schmid, 1994)) and we normalize the tags with Universal Tagset of Petrov et al. (2012). Then, we assign a weight to each type of tag: this weight will be used to compute the final vector representation of the unit. Finally, we optimize the weights with the help of Condor (Berghen and Bersini, 2005). Condor applies a Newton's method with a trust region algorithm to determinate the weights that optimize the F 1 score. We use the first two folds of each sub-corpus to determinate the optimal weights. The formula of the syntactic aggregation is: where u i is the i th word of the textual unit, pos is the function which gives the universal part-ofspeech tag of a word, weight is the function which gives the weight of a part-of-speech, vector is the function which gives the word embedding vector of a word and . is the scalar product.
If U x and U y are two textual units in two different languages, we build their representation vectors V x and V y following the formula (3) instead of (2), and apply a cosine similarity between them. We call this method CL-WESS and we have implemented it in MultiVec (Berard et al., 2016).
It is important to note that, contrarily to what is done in other tasks such as neural parsing (Chen and Manning, 2014), we did not use POS information as an additional vector input because we considered it would be more useful to use it to weight the contribution of each word to the sentence representation, according to its morpho-syntactic category.

Weighted Fusion
We try to combine our methods to improve crosslanguage similarity detection performance. During weighted fusion, we assign one weight to the similarity score of each method and we calculate a (weighted) composite score. We optimize the distribution of the weights with Condor (Berghen and Bersini, 2005). We use the first two folds of each sub-corpus to determinate the optimal weights, while the other eight folds evaluate the fusion. We also try an average fusion, i.e. a weighted fusion where all the weights are equal. Regardless of their capacity to predict a (mis)match, an interesting feature of the methods is their clustering capacity, i.e. their ability to correctly separate the positives (similar units) and the negatives (different units) in order to minimize the doubts on the classification. Distribution histograms on Figure 1 highlight the fact that each method has its own fingerprint. Even if two methods look equivalent in term of final performance, their distribution can be different. One explanation is that the methods do not process on the same way. Some methods are lexical-syntax-based, others process by aligning concepts (more semantic) and still others capture context with word vectors. For instance, CL-C3G has a narrow distribution of negatives and a broad distribution for positives (Figure 1 (a)), whereas the opposite is true for CL-ASA (Figure 1 (b)). We try to exploit this complementarity using decision tree based fusion. We use the C4.5 algorithm (Quinlan, 1993) implemented in Weka 3.8.0 (Hall et al., 2009). The first two folds of each sub-corpus are used to determinate the optimal decision tree and the other eight folds to evaluate the fusion (same protocol as weighted fusion). While analyzing the trained decision tree, we see that CL-C3G, CL-WESS and CL-CTS-WE are the closest to the root. This confirms their relevance for similarity detection, as well as their complementarity.

Results and Discussion
Use of word embeddings. We can see in Table 1 that the use of distributed representation of words instead of lexical resources improves CL-CTS (CL-CTS-WE obtains overall performance gain of +3.83% on chunks and +3.19% on sentences). Despite this improvement, CL-CTS-WE remains less efficient than CL-C3G. While the use of bilingual sentence vector (CL-WES) is simple and elegant, its performance is lower than three state-of-the-art methods. However, its syntactically weighted version (CL-WESS) looks very promising and boosts the CL-WES overall performance by +11.78% on chunks and +14.92% on sentences. Thanks to this improvement, CL-WESS is significantly better than CL-C3G (+2.97% on chunks and +7.01% on sentences) and is the best single method evaluated so far on our corpus.
Fusion. Results of the decision tree fusion are reported at both chunk and sentence level in Table 1. Weighted and average fusion are only re-  ported at chunk level. In each case, we combine the 8 previously presented methods (the 5 state-of-the-art and the 3 new methods). Weighted fusion outperforms the state-of-the-art and the embedding-based methods in any case. Nevertheless, fusion based on a decision tree looks much more efficient. At chunk level, decision tree fusion leads to an overall F 1 score of 89.15% while the precedent best weighted fusion obtains 80.01% and the best single method only obtains 53.73%. The trend is the same at the sentence level where decision tree fusion largely overpasses any other method (88.50% against 56.35% for the best single method). In our evaluation, the best decision tree, for an overall higher than 85% of correct classification on both levels, involves at a minimum CL-C3G, CL-WESS and CL-CTS-WE. These results confirm that different methods proposed complement each other, and that embeddings are useful for cross-language textual similarity detection.

Conclusion and Perspectives
We have augmented several baseline approaches using word embeddings. The most promising approach is a cosine similarity on syntactically weighted distributed representation of sentence (CL-WESS), which beats in overall the precedent best state-of-the-art method. Finally, we have also demonstrated that all methods are complementary and their fusion significantly helps crosslanguage textual similarity detection performance. At chunk level, decision tree fusion leads to an overall F 1 score of 89.15% while the precedent best weighted fusion obtains 80.01% and the best single method only obtains 53.73%. The trend is the same at the sentence level where decision tree fusion largely overpasses any other method. Our future short term goal is to work on the improvement of CL-WESS by analyzing the syntactic weights or even adapt them according to the plagiarist's stylometry. We have also made a submission at the SemEval-2017 Task 1, i.e. the task on Semantic Textual Similarity detection.