CompiLIG at SemEval-2017 Task 1: Cross-Language Plagiarism Detection Methods for Semantic Textual Similarity

We present our submitted systems for Semantic Textual Similarity (STS) Track 4 at SemEval-2017. Given a pair of Spanish-English sentences, each system must estimate their semantic similarity by a score between 0 and 5. In our submission, we use syntax-based, dictionary-based, context-based, and MT-based methods. We also combine these methods in unsupervised and supervised way. Our best run ranked 1st on track 4a with a correlation of 83.02% with human annotations.


Introduction
CompiLIG is a collaboration between Compilatio 1 -a company particularly interested in crosslanguage plagiarism detection -and LIG research group on natural language processing (GETALP). Cross-language semantic textual similarity detection is an important step for cross-language plagiarism detection, and evaluation campaigns in this new domain are rare. For the first time, SemEval STS task (Agirre et al., 2016) was extended with a Spanish-English cross-lingual sub-task in 2016. This year, sub-task was renewed under track 4 (divided in two sub-corpora: track 4a and track 4b).
Given a sentence in Spanish and a sentence in English, the objective is to compute their semantic textual similarity according to a score from 0 1 www.compilatio.net to 5, where 0 means no similarity and 5 means full semantic similarity. The evaluation metric is a Pearson correlation coefficient between the submitted scores and the gold standard scores from human annotators. Last year, among 26 submissions from 10 teams, the method that achieved the best performance (Brychcin and Svoboda, 2016) was a supervised system (SVM regression with RBF kernel) based on word alignment algorithm presented in Sultan et al. (2015).
Our submission in 2017 is based on crosslanguage plagiarism detection methods combined with the best performing STS detection method published in 2016. CompiLIG team participated to SemEval STS for the first time in 2017. The methods proposed are syntax-based, dictionary-based, context-based, and MT-based. They show additive value when combined. The submitted runs consist in (1) our best single unsupervised approach (2) an unsupervised combination of best approaches (3) a fine-tuned combination of best approaches. The best of our three runs ranked 1 st with a correlation of 83.02% with human annotations on track 4a among all submitted systems (51 submissions from 20 teams for this track). Correlation results of all participants (including ours) on track 4b were much lower and we try to explain why (and question the validity of track 4b) in the last part of this paper. CL-CnG aims to measure the syntactical similarity between two texts. It is based on Mcnamee and Mayfield (2004) work used in information retrieval. It compares two texts under their n-grams vectors representation. The main advantage of this kind of method is that it does not require any translation between source and target text.
After some tests on previous year's dataset to find the best n, we decide to use the Potthast et al. (2011)'s CL-C3G implementation. Let S x and S y two sentences in two different languages. First, the alphabet of these sentences is normalized to the ensemble = {a − z, 0 − 9, }, so only spaces and alphanumeric characters are kept. Any other diacritic or symbol is deleted and the whole text is lower-cased. The texts are then segmented into 3-grams (sequences of 3 contiguous characters) and transformed into tf.idf vectors of character 3-grams. We directly build our idf model on the evaluation data. We use a double normalization K (with K = 0.5) as tf (Manning et al., 2008) and a inverse document frequency smooth as idf. Finally, a cosine similarity is computed between the vectors of source and target sentences.

Cross-Language Conceptual
Thesaurus-based Similarity (CL-CTS) (Gupta et al., 2012;Pataki, 2012) aims to measure the semantic similarity between two vectors of concepts. The model consists in representing texts as bag of words (or concepts) to compare them. The method also does not require explicit translation since the matching is performed using internal connections in the used "ontology". Let S a sentence of length n, the n words of the sentence are represented by w i as:

CL-CTS
(1) S x and S y are two sentences in two different languages. A bag of words S ′ from each sentence S is built, by filtering stop words and by using a function that returns for a given word all its possible translations. These translations are jointly given by a linked lexical resource, DBNary (Sérasset, 2015), and by cross-lingual word em-beddings. More precisely, we use the top 10 closest words in the embeddings model and all the available translations from DBNary to build the bag of words of a word. We use the MultiVec (Berard et al., 2016) toolkit for computing and managing word embeddings. The corpora used to build the embeddings are Europarl and Wikipedia sub-corpus, part of the dataset of Ferrero et al. (2016) 2 . For training our embeddings, we use CBOW model with a vector size of 100, a window size of 5, a negative sampling parameter of 5, and an alpha of 0.02.
So, the sets of words S ′ x and S ′ y are the conceptual representations in the same language of S x and S y respectively. To calculate the similarity between S x and S y , we use a syntactically and frequentially weighted augmentation of the Jaccard distance, defined as: where S x and S y are the input sentences (also represented as sets of words), and Ω is the sum of the weights of the words of a set, defined as: where w i is the i th word of the bag S, and ϕ is the weight of word in the Jaccard distance: where pos weight is the function which gives the weight for each universal part-of-speech tag of a word, idf is the function which gives the inverse document frequency of a word, and . is the scalar product. Equation (4) is a way to syntactically (pos weight) and frequentially (idf ) weight the contribution of a word to the Jaccard distance (both contributions being controlled with the α parameter). We assume that for one word, we have its part-of-speech within its original sentence, and its inverse document frequency. We use TreeTagger (Schmid, 1994) for POS tagging, and we normalize the tags with Universal Tagset of Petrov et al. (2012). Then, we assign a weight for each of the 12 universal POS tags. The 12 POS weights and the value α are optimized with Condor (Berghen and Bersini, 2005) in the same way as in Ferrero et al. (2017). Condor applies a Newton's method with a trust region algorithm to determinate the weights that optimize a desired output score. No re-tuning of these hyperparameters for SemEval task was performed.

Cross-Language Word Embedding-based Similarity
CL-WES (Ferrero et al., 2017) consists in a cosine similarity on distributed representations of sentences, which are obtained by the weighted sum of each word vector in a sentence. As in previous section, each word vector is syntactically and frequentially weighted.
If S x and S y are two sentences in two different languages, then CL-WES builds their (bilingual) common representation vectors V x and V y and applies a cosine similarity between them. A distributed representation V of a sentence S is calculated as follows: where w i is the i th word of the sentence S, vector is the function which gives the word embedding vector of a word, ϕ is the same that in formula (4), and . is the scalar product. We make this method publicly available through MultiVec 3 (Berard et al., 2016) toolkit.

Translation + Monolingual Word Alignment (T+WA)
The last method used is a two-step process. First, we translate the Spanish sentence into English with Google Translate (i.e. we are bringing the two sentences in the same language). Then, we align both utterances. We reuse the monolingual aligner 4 of Sultan et al. (2015) with the improvement of Brychcin and Svoboda (2016), who won the cross-lingual sub-task in 2016 (Agirre et al., 2016). Because this improvement has not been released by the initial authors, we propose to share our re-implementation on GitHub 5 . If S x and S y are two sentences in the same language, then we try to measure their similarity with the following formula: where S x and S y are the input sentences (represented as sets of words), A x and A y are the sets of aligned words for S x and S y respectively, and ω is a frequency weight of a set of words, defined as: where idf is the function which gives the inverse document frequency of a word.

System Combination
These methods are syntax-, dictionary-, contextand MT-based, and are thus potentially complementary. That is why we also combine them in unsupervised and supervised fashion. Our unsupervised fusion is an average of the outputs of each method. For supervised fusion, we recast fusion as a regression problem and we experiment all available methods in Weka 3.8.0 (Hall et al., 2009). Table 1 reports the results of the proposed systems on SemEval-2016 STS cross-lingual evaluation dataset. The dataset, the annotation and the evaluation systems were presented in the SemEval-2016 STS task description paper (Agirre et al., 2016), so we do not re-detail them here. The lines in bold represent the methods that obtain the best mean score in each category of system (best method alone, unsupervised and supervised fusion). The scores for the supervised systems are obtained with 10-folds cross-validation.

Runs Submitted to SemEval-2017
First, it is important to mention that our outputs are linearly re-scaled to a real-valued space [0 ; 5].
Run 1: Best Method Alone. Our first run is only based on the best method alone during our tests (see Table 1), i.e. Cross-Language Conceptual Thesaurus-based Similarity (CL-CTS) model, as described in section 2.2.
Run 2: Fusion by Average. Our second run is a fusion by average on three methods: CL-C3G, CL-CTS and T+WA, all described in section 2.
Run 3: M5 ′ Model Tree. Unlike the two precedent runs, the third run is a supervised system. We have selected the system that obtained the best score during our tests on SemEval-2016 evaluation dataset (see Table 1), which is the M5 ′ model tree (Wang and Witten, 1997)  M5P in Weka 3.8.0 (Hall et al., 2009)). Model trees have a conventional decision tree structure but use linear regression functions at the leaves instead of discrete class labels. The first implementation of model trees, M5, was proposed by Quinlan (1992) and the approach was refined and improved in a system called M5 ′ by Wang and Witten (1997). To learn the model, we use all the methods described in section 2 as features.

Results of the 2017 evaluation and Discussion
Dataset, annotation and evaluation systems are presented in SemEval-2017 STS task description paper (Agirre et al., 2017). We can see in   Table 3: Results of our submitted systems scored on our 120 annotated pairs and on the same 120 SemEval annotated pairs.
To investigate deeper on this issue, we manually annotated 60 random pairs of each sub-corpus (120 annotated pairs among 500). These annotations provide a second annotator reference. We can see in Table 3 that, on SNLI corpus (4a), our methods behave the same way for both annotations (a difference of about 1.3%). However, the difference in correlation is huge between our annotations and SemEval gold standard on the WMT corpus (4b): 30% on average. The Pearson correlation between our annotated pairs and the related gold standard is 85.76% for the SNLI corpus and 29.16% for the WMT corpus. These results question the validity of the WMT corpus (4b) for semantic textual similarity detection.

Conclusion
We described our submission to SemEval-2017 Semantic Textual Similarity task on track 4 (Sp-En cross-lingual sub-task). Our best results were achieved by a M5 ′ model tree combination of various textual similarity detection techniques. This approach worked well on the SNLI corpus (4afinishes 1 st with more than 83% of correlation with human annotations), which corresponds to a real cross-language plagiarism detection scenario. We also questioned WMT corpus (4b) validity providing our own manual annotations and showing low correlations with those of SemEval.