Non-Literal Text Reuse in Historical Texts: An Approach to Identify Reuse Transformations and its Application to Bible Reuse

Text reuse refers to citing, copying or alluding text excerpts from a text resource to a new context. While detecting reuse in contemporary languages is well supported—given extensive research, techniques, and corpora— automatically detecting historical text reuse is much more difﬁcult. Corpora of historical languages are less documented and often encompass various genres, linguistic varieties, and topics. In fact, historical text reuse detection is much less understood and empirical studies are necessary to enable and improve its automation. We present a linguistic analysis of text reuse in two ancient data sets. We contribute an automated approach to analyze how an original text was transformed into its reuse, taking linguistic resources into account to understand how they help characterizing the transformation. It is complemented by a manual analysis of a subset of the reuse. Our results show the limitations of approaches focusing on literal reuse detection. Yet, linguistic resources can effectively support understanding the non-literal text reuse transformation process. Our results support practitioners and researchers working on understanding and detecting historical reuse.


Introduction
The computational detection of historical text reuseincluding citations, quotations or allusions -can be applied in many respects. It can help tracing down historical content (a.k.a., lines of transmission), which is essential to the field of textual criticism (Büchler et al., 2012). In the context of massive digitization projects, it can identify relationships between text excerpts referring to the same source. Specifically, detecting copies of the same historical text that have diverged over time (manuscript studies, a.k.a., Stemma Codicum) is an important task.
Although much work exists in the field of natural language processing (NLP), many new challenges arise when processing historical text. The most important challenges are the absence of supporting tools and methods, including an agreement on a common orthography, standardization of variants, and a wide range of clean, digitized text (Piotrowski, 2012; Geyken and Gloning, 2014; Zitouni, 2014). Typical statistical approaches from the field of NLP are difficult to apply to historically transferred texts, since these often cover a large timespan and, thus, comprise many different writing styles, text variants or even reuse styles (Büchler, 2013). Our long-term goal is to conceive robust text reuse detection techniques for historical texts. To this end, we need to improve the quantitative empirical understanding of such reuse accompanied by qualitative empirical studies. However, only few such works exist.
We study less-and non-literal text reuse of Bible verses in Ancient Greek and Latin texts. Our focus is on understanding how the reuse instances are transformed from the original verses. We identify operations that characterize how words are changed-e.g., synonymized, capitalized or part-of-speech (PoS) information changed. Since our approach uses external linguistic resources, including Ancient Greek Word-Net (AGWN) (Bizzoni et al., 2014;Minozzi, 2009) and various lemma lists, we also show how such resources can help detecting reuse and where the limitations are. We complement the automated approach with a qualitative manual analysis. We contribute: • an automated approach to characterize how text is transformed between reuse and original, • an application of the approach to two text datasets where reuse was manually identified, • empirical data based on the automated approach, complemented by a manual identification.
Our resulting datasets 1 with rich information about the reuse transformation (e.g., PoS and morphology changes, and words becoming synonyms or hyperonyms, among others) can be used as a benchmark for future reuse detection and classification approaches.

Related Work
We first discuss why existing reuse detection approaches are not applicable to historical texts, and then present works trying to address this problem. Historical Text Reuse and Plagiarism Detection. Büchler (2013) combines state-of-the-art NLP techniques to address reuse detection scenarios for historical texts, ranging from near copies to text excerpts with a minimum overlap. He uses the commonly used method fingerprinting, which selects n-grams from an upfront pre-segmentized corpus. While his approach can discover historical and modern text reuse language-independently, it requires a minimum text similarity-typically at least two common features.
Recognizing modified reuse is difficult in general. Alzahrani et al. (2012) study plagiarism detection techniques: n-gram-, syntax-, and semantics-based approaches. As soon as reused text is slightly modified (e.g., words changed) most systems fail. Barrón-Cedeño et al. (2013) conduct experiments on paraphrasing, observing that complex paraphrasing along with a high paraphrasing density challenges plagiarism detection, and that lexical substitution is the most frequent technique for plagiarizing. The Ara-PlagDet (Bensalem et al., 2015) initiative focuses on the evaluation of plagiarism detection methods for Arabic texts. Eight methods were submitted and turned out to work with a high accuracy on external plagiarism detection but did not achieve usable results for intrinsic plagiarism detection. Corpora. Huge parallel corpora of modern languages are used in fields such as paraphrase gen-1 https://bitbucket.org/mariamoritz/emnlp eration and detection, typically used to train statistical models (Zhao et al., 2009;Madnani and Dorr, 2010). However, such corpora hardly exist for historical languages or are copyrighted, such as the TLG digital library (Pantelia, 2014). Especially in the field of modern reuse investigation, aligned corpora are often used, providing a rich source of paraphrasal sentence pairs in one, sometimes multiple languages. One of such is the Microsoft Research Paraphrase Corpus (MSRP), which contains 5801 manually evaluated, paraphrasal sentence pairs in English (Dolan and Brockett, 2005 (Crane, 1985), among others. For example, Bamman (2008) presents the discovery of textual allusions in a collection of Classical poetry, using measures such as token similarity, n-grams or syntactic similarity. This allows finding at least the most similar candidates within a closed library. Some works have focused on text reuse in Biblical Greek text. Lee (2007) investigate reuse among the Gospels of the New Testament, aimed at aligning similar sentences. Using source alternation patterns, among others, the approach uses cosine similarity, source verse proximity, and source verse order. Focusing on high recall, the detection of Homeric quotations in Athenaeus' Deipnosophistai' was investigated by Büchler et al. (2012), searching for distinctive words within reuse.
While the approaches above rely on string or feature similarity, Bamman (2011b) attempts to process the semantic space using word-sense disambiguation (Patwardhan et al., 2003; Agirre and Edmonds, 2007). Using a bilingual sense inventory and training set, they classify up to 72 % of word senses correctly. Utilizing Linguistic Resources. Word nets support identifying word relationships. Jing (1998) investigates issues that come with using WordNet (Miller et al., 1990) for language generation. Among others, these comprise issues arising from the adaption of a general lexicon to a specific domain. These were encountered by using a domain corpus and an ontology to prune WordNet to a certain domain.
In our work, we are interested in using linguistic resources (word nets and lemma lists) together with PoS information to model the transformation process of reuse, specifically on an ancient language text to find limitations when applied to non-literal text reuse.

Methodology
Our study addresses two main research questions: RQ1. What is the extent of non-literal reuse in our datasets? This analysis provides a baseline for the following characterizations of the non-literal reuse. RQ2. How is the non-literally reused text modified in our datasets? We study kinds and frequencies of semantic, lexical, and morphological changes. We develop an automated approach to identify the reuse transformation, and complement it with a manual, qualitative analysis. We formulate two sub-questions: RQ2.1. How can linguistic resources support the discovery of non-literal reuse? We conjecture that non-literal reuse is difficult to capture automatically (especially due to domain-or author-specific words), but that taking linguistic resources into account helps. We analyze the coverage of words in lemma lists and a synset database, and investigate how useful they are for understanding the reuse transformations. RQ2.2. What are the limitations of an automated classification approach relying on linguistic resources? Our manual analysis investigates the reuse in its full richness, to understand the limitations of the automated approach and identify further characteristics of the reuse in our datasets.

Study Design
Our study comprises the following main steps. First (RQ1), we identify and characterize the literal and non-literal overlap in reuse instances. Second (towards RQ2), we define operations reflecting literal reuse, replacements (inspired by semantic relationships, such as synonyms and hyperonyms, supported by AGWN), and morphological changes (e.g., when mapping words contain the same cognate). Our operations are based on a one-word-replacement to better quantify the results. Third (RQ2.1), we develop an algorithm that identifies operations by first looking for morphological changes between a word from the reuse and its corresponding candidate from the Bible verse and, in case of no success, by seeking for a semantic relation. We apply it to our two datasets and investigate the relationships of affected words and the literal share. We quantify occurrences of operations and calculate two measures sup lem (lemma support) and sup AGWN (AGWN support) to assess the resources' coverage for our approach. Fourth (RQ2.2), we manually analyze a smaller sample of our reuse datasets, using further operations, to understand the full richness of the reuse.

Datasets
We use the following two text sources, both reusing content from Bible verses. As a ground truth of the reuse, we use manually annotated versions of both, provided to us by Mellerin (2014)  Our first dataset comes from the primary source text of "Salvation for the Rich" from the Ancient Greek writer Clement of Alexandria (Clément d'Alexandrie, 2011), a well-known author in Biblical literature (Cosaert, 2008). The Biblindex team annotated 128 text passages as Bible reuse instances, adding a footnote with Bible verse pointers to each. We select a total of 95 out of these 128, following four criteria: (i) reuse should not consist of an exact literal copy of a Bible verse (skipping six instances), (ii) reuse should be recognizable by our expert (skipping ten instances), (iii) the reference frame should be within five Bible verses (comparable with sentences) to avoid too much noise in our data to ensure a comparable length to the original Bible verse (skipping nine instances), and (iv) reuse instances should not exceed a length of 40 tokens (1-2 sentences), again to cut the long tail and avoid too much noise (skipping eight instances). Sometimes one reuse instance pointed to different Bible verses or one text passage contained more than one reuse instance, thus, we come up with 199 verse-reuse-pairs. The excerpts point to a total of 15 Bible books.
Our second dataset are extracts from a total of 14 volumes of twelve works and two work collec-  Fig. 1 shows reuse examples, illustrating the wide range of literalness in our data, comprising literal (all tokens overlap), less literal (important tokens overlap), and non-literal (no content word tokens overlap) reuse. For example, Clement's reuse ranges from introducing the overall topic by citing multiple verses, to supporting his argumentation. Specifically, Mk 10 30 is a fully literal reuse from a passage that discusses the problem of rich men in heaven. Clement uses this episode as a main point in his essay. Later he refers to 1Cor 13 13, he again refers to how hard it would be for rich men to enter heaven, explaining that salvation is independent of "external things," but depends on the "virtue of the soul," mentioning faith, Algorithm 1: Reuse classification algorithm / * Executed for each reuse instance and its corresponding Bible verse. morph(x) returns the part-of-speech and/or case of x. repl case and repl pos are masked to repl morph for clarity reasons. checkm(x,y) returns NOPmorph(morph(x),morph(y)) if morph(x) equals morph(y) and repl morph(morph(x),morph(y)) otherwise. * / input : L ← set of word-lemma pairs obtained from the lemma resources input : S ← set of synsets from AGWN; each synset contains an id and a parent id input : T ← list of words of reuse instance (containing part-of-speech information) input : B ← list of words of Bible verse (containing part-of-speech information) output : OP ← list of sets containing up to 3 parameterized operations s1, s2 ← any two synsets ∈ S. tmp op ← temporary variable which presents the absence of a relation but not of a lemma. hope, and love, the key words in the original verse.

PoS Tagging
The automated and the manual approach also take PoS information into account to understand the reuse transformation. Following the Greek morphol- n/a n/a n/a n/a n/a 2618 Bernard 5,6 n/a n/a n/a n/a n/a 1335  We also assign cases for the classes noun, article, adjective, and pronoun. We introduce b to represent the Latin ablative case, which does not exist in Greek.

Automated Approach
Our approach is to model the transformation process in terms of parameterized operations applied to the words in the reuse instance in order to obtain the original words.  Table 1 shows the coverage of each resource for our datasets. In the lower part of it we merge all lemma resources into one set of wordlemma pairs. The table shows that CLTK covers the Bible data better than the Hellenistic Greek as used in Clement of Alexandria, an author from 2nd century AD, writing in an archaic style with Biblical vocabulary, while also being influenced by Classical Greek. We also check the coverage of lemmata stemming from the same source (Biblindex) as our reuse. To increase the coverage for Greek, we consult SBLGNT&LXX, which in fact increases it. To not miss important information, we integrate all of the resources' data into our approach. For every lemma of a word we check the semantic relations in AGWN. We experimented with different ways of looking up lemmas and found that lower-casing all Latin tokens improved the success. For Greek, it had the opposite effect, which indicates that the Greek text contains more entities that are not available in lowercase in the lemma lists, so we did not change in that case. 5 Operations and Classification. We define replacement operations using words and PoS as parameters, to transform a reuse instance back into the Bible verse it originates from. Table 2 lists the operations for the computational approach. We introduce the operations NOPmorph, repl pos, and repl case for words having the same cognate, and lemma missing(reuse word) when a word is not 4 CATSS LXX is prepared by the Thesaurus Linguae Graecae project directed by T. Brunner at UC Irvine, with further verification and adaptation by CATSS towards conformity with the individual Göttingen editions which appeared since 1935. LXXM is morphologically analyzed text of CATSS LXX prepared by CATSS led by R. Kraft (Philadelphia team) 5 Often, the decision on whether to represent a word in upper or lower case letters is made by the editor, thus, our decision is affected by the edition we use for our research.  known to any of our lemma resources as well as no rel found(reuse word, orig word) when the relationship between a reuse word and each potential word from the original is not covered by AGWN. Algorithm 1 shows our approach to classify the reuse transformation by identifying the operations. For each reuse token, we identify the first applicable operation matching the foremost Bible verse word (iterating the verse) in the following order: exact word match (NOP: no operation), case changed to upper or lower. Thereafter, we look up the lemma and return lem if the lemma of the reused word matches the lemma of the original. For these four, we also check the morphology, in addition returning whether the original has the same PoS and case (NOPmorph) or whether PoS changed (repl pos), case changed (repl case), or both. So up to three operations can be returned per word. Finally, we check for synonyms (repl syn), hyperonyms (hyper), hyponyms (hypo), and co-hyponyms (repl co-hypo), but do not check morphology. If a Bible verse word is used as a match, it is not used again for any other word from the reuse.

Qualitative Approach
To obtain a deeper understanding of the limitations of linguistic resources for our purpose, two graduate students (one Latinist, one Classical Archeologist) manually analyze 100 Greek and 60 Latin reuse instances with their expert knowledge, using an extended set of operations. It comprises ins(word) (insert a word) and del(word) (delete a word)-two operations we ignore in the automated approach where we focuse on the coverage of the resources. It also has a richer set of replacement operations: those from the upper part of Table 2 (without upper and lower), and instead of only using repl case when a cognate stays the same, we refine it and assign all changing morphological categories from Perseus' tag set for any "relativeness" between two words (e.g., repl case a g).

Results
We now present the results for our research questions in Sec. 4.1-4.3, which are summarized and further interpreted in Sec. 4.4.

Literal Share of the Reuse (RQ1)
We obtain a first understanding of the reuse by looking at the percentage of overlapping words between reuse instance and original Bible verse. We measure the longest common substring based on word tokens. Fig. 2 shows the distributions, distinguishing between a lemmatized and non-lemmatized word comparison.
While lemmatizing words before comparison has only a small impact, we observe differences between the datasets. In our Latin dataset, the overlap is significantly higher than in the Greek dataset Sec. 3.2. 25 % (upper quartile) of Bernard's reuse instances have 50 % or more tokens overlap with their original, which is only the case for less than 25 % in Clement's Greek data. Still, large overlaps of up to 75 % (top non-lem. lem. non-lem. lem. For a more precise understanding of the literalness, we group operations into literal (NOP, upper, lower, lem), non-literal (repl syn, repl hyper, repl hypo, repl co-hypo), and unclassified (no rel found and lemma missing). Within each reuse instance, we calculate their relative occurrence using the results of the automated approach (explained shortly). Fig. 3 shows the distribution of these relative occurrences for all reuse instances. It confirms Fig. 2 by showing a higher rate of literalness for Latin compared to Greek. In summary, it also shows that the Latin reuse can be better classified by our approach, which takes the lemma lists and AGWN into account. Table 3 shows the total number of operations identified for the transformation from reuse instances to the Greek and Latin originals. For 987 (45 %) out of 2189 words in the Greek instances and for 893 (67 %) out of 1335 words in the Latin instances, we were able to identify at least one operation, which already indicates to what extent the resources are helpful. Fig. 4 visualizes the distribution of the frequencies (y-axis) of each operation (x-axis) together with the distribution of the operations' positions in the reuse instances (z-axis). The latter is calculated as the relative position p ∈ [0..1] of an operation with respect to the length of the reuse instance. It indicates that most operation types are distributed over the whole reuse NOP upper lower lem syn hyper hypo co-hypo   length without a particular trend in both datasets. We only encounter a frequent use of upper at the first position in Latin, which means that Bernard often starts his Biblical references with literal Bible words.

Automated Approach (RQ2.1)
After having checked the overall coverage of the linguistic resources for all tokens (cf. Sec. 3.4), we now specifically investigate to what extent the resources support identifying the reuse transformation for the non-literal reuse using our approach. We introduce the measures sup lem and sup AGWN to calculate how often looking up a lemma or subsequently a synset element was successful. This is easy based on our operations. Let Occ(o) be the number of occurrences of an operation o, obtained from  These values can be interpreted as follows. The lemma resources for genre-and time-specific text work well for less-literal reuse, but the resources for semantic relationships (synset databases) show a lack of support and need further development.

Qualitative Approach (RQ2.2)
We manually identify the transformation operations for 60 reuse instances of the Ancient Greek data and for 100 of the Latin data. Here, NOPs cover 9.3 %, insertions 49.8 %, and deletions cover 30.5 % in the Greek data. NOPs cover 26.1 %, insertions 49.7 %, and deletions 11.9 % in the Latin data. Table 4 shows the ratios of the various repl operations based on the remaining 10.4 % and 12.2 %. Similar to the automated approach, we observe a strong use of synonyms and other semantic-level operations, and also a certain portion of switching morphological categories, which indicates para-phrasal reuse. In the Greek data, PoS changes cover about 9%, out of which a participle became a verb (7 times) and vice-versa (5 times). In our Latin data, PoS changes represent 15% of replacements: often a pronoun changed to a noun (6 times) and a participle became a verb (12 times). Case changes are shown in Table 5. Significantly often, an ablative became an accusative, because often changing prepositions expect different cases, or an accusative was replaced by an ablative or nominative, because para-phrasal expression changed.
We encounter exceptions that prevent applying the operations. In the Greek data, one word is replaced operation Greek Latin operation Greek Latin repl case a b 0 6 repl case g a 5 2 repl case a n 9 4 repl case g n 4 2 repl case b a 0 10 repl case n a 7 5 repl case d a 0 2 repl case n d 3 0 repl case d g 3 0 repl case v g 0 2 repl case d n 5 0 Table 5: Numbers of case replacements with its antonym 6 ; once, a synonym also changes its PoS. Four times, more than one morphological category changes, twice an auxiliary is deleted, and five times inserted. We find one writing variance (lem), and three times a synonym is replaced by a multiword expression. In the Latin data, in 16 cases a synonym is replaced and morphological information changed. Seven times, more than one morphological parameter changes for the same cognate. Eight times, an auxiliary is inserted or deleted, and twice, a writing variance is encountered. A synonym is replaced by more than one word five times. In one case, a reuse is too paraphrasal for any word to match semantic relationships (e.g., judged calmly-Bernard vs. fake friend -Sal 12 18).

Summary and Discussion
RQ1. The reuse is significantly non-literal and only lemmatizing words does not help discovering it. Our results show that reuse in two substantial historical texts requires techniques beyond simple preprocessing (e.g., stemming or lemmatizing), which explains why plagiarism-detection systems fail when paraphrases are used (Alzahrani et al., 2012  Our results show that the automated approach cannot capture the richness of the manual approach. Especially from the exceptions, it is clear that less-literal reuse does not only need information from a word's semantic environment, but also that it needs to be identified by looser relations, such as co-hyponyms, multi-to-multi-word associations or implicit meanings, which can be hidden in structural or more broader expert knowledge.

Threats to Validity
External Validity. We enhance the external validity of our work by focusing on Bible verses-one of the oldest, most conveyed, and cited sources of Ancient Greek, offering a vast amount of primary source text and also coming with a long history of scholars studying it. Clement of Alexandria is known for his retelling of biblical excerpts (Clemens, 1905(Clemens, 1909Freppel, 1865), providing an interesting base for reuse investigation. The french abbot Bernard of Clairvaux (Smith, 2010) is equally known for his influence to the Cistercian order and his work in biblical studies. Furthermore, the chosen lemma resources are the most extensive ones existing for Ancient Greek and Latin. We chose the AGWN, since it is freely available, offering one of the largest synset database for Ancient Greek and Latin. Internal Validity. A threat is that our ground truth has mistakes, as the PoS tagging was done by one author only and relied on a manual post-correction. The selection criteria in Sec. 3.2 were chosen to ensure quality and comparability. Extreme outliers in the length of the reuse instance or source (multiple Bible verses) are cut-off. For Greek, 33 are cut-off, as opposed to Latin, where our sample is significantly smaller than the whole population that we have. To automatically check whether the sample has similar characteristics with respect to the literal reuse, we create Fig. 5. It shows the overlap of the whole 1128 instances of Bernard's extracted reuse, which when compared to Fig. 2 (right) supports the representativeness of our sample. Last, we can only derive operation replacements when a word token was covered by the lemma sources, contained in AGWN, and when there actually exists a relation between two words. Also, our authors' vocabulary can differ in terms of domain knowledge, personal idiolect, and age of the Biblical vocabulary.

Conclusion
We presented a study of historical-and mostly nonliteral-text reuse. We automatically and manually characterize the reuse and identify to what extent existing linguistic resources are able to cover non-literal text reuse. Our results show the potential as well as the necessity to develop robust techniques and to extend linguistic resources for analyzing and detecting such reuse. Our results can help to enhance paraphrase generation to model automatic ways on how small text portions can be rephrased. Considering the effects of syntactic rearrangement of reuse can also support such efforts. A smarter automated approach for deriving an original text excerpt would be learning so-called edit scripts (Kehrer, 2014;Chawathe et al., 1996), which more precisely identify operations an author performed on a text to transform it into another version. Whether learning edit scripts on such intricate transformations is possible is an open question and valuable future research. Finally, analyzing further languages and data sets helps to further complete our findings.