Restoring ancient text using deep learning: a case study on Greek epigraphy

Ancient History relies on disciplines such as Epigraphy, the study of ancient inscribed texts, for evidence of the recorded past. However, these texts, “inscriptions”, are often damaged over the centuries, and illegible parts of the text must be restored by specialists, known as epigraphists. This work presents Pythia, the first ancient text restoration model that recovers missing characters from a damaged text input using deep neural networks. Its architecture is carefully designed to handle long-term context information, and deal efficiently with missing or corrupted character and word representations. To train it, we wrote a non-trivial pipeline to convert PHI, the largest digital corpus of ancient Greek inscriptions, to machine actionable text, which we call PHI-ML. On PHI-ML, Pythia’s predictions achieve a 30.1% character error rate, compared to the 57.3% of human epigraphists. Moreover, in 73.5% of cases the ground-truth sequence was among the Top-20 hypotheses of Pythia, which effectively demonstrates the impact of this assistive method on the field of digital epigraphy, and sets the state-of-the-art in ancient text restoration.


Introduction
One of the key sources for Ancient History is the discipline of epigraphy, which delivers firsthand evidence for the thought, society and history of ancient civilisations. Epigraphy is the study of documents, "inscriptions", written on a durable surface (stone, ceramic, metal) by individuals, groups and institutions of the past (Davies and Wilkes, 2012). Only a small minority of surviving inscriptions are fully legible and complete, as many have been damaged in time (Figure 1). An epigraphist must then hypothesise how much text is missing, † These authors contributed equally to this work. and what it might have originally been. These hypotheses are called "restorations" (Bodel, 2012). The present work offers a fully automated aid to the epigraphist's restoration task.
Restoring text is a complex and time-consuming task (Woodhead, 1967;Mattingly, 1996). Epigraphists rely on accessing vast repositories of information to find textual and contextual "parallels" (recurring expressions in similar documents). These repositories primarily consist in a researcher's mnemonic repertoire of such parallels, and in digital corpora for performing "string matching" searches (The Packard Humanities Institute, 2005;Clauss, 2012). However, minor differences in the search query can exclude or obfuscate relevant results, making it hard to estimate the true probability distribution of possible restorations. To the best of our knowledge, this is the first work to bypass the constraints of current epigraphic methods by means of a fully automated deep learning model, PYTHIA, which aids the task of ancient text restoration. It is supplemented by PHI-ML, an epigraphic dataset of a machine actionable text. PYTHIA takes as input a sequence of damaged text, and is trained to predict character sequences comprising the hypothesised restorations. It works both at a character-and a word-level, thereby effectively handling incomplete or missing words. PYTHIA can furthermore be used by all disciplines dealing with ancient texts (philology, papyrology, codicology) and applies to any language (ancient or modern). To aid and encourage future research, PYTHIA and PHI-ML have been open-sourced at http://tiny.cc/ancient-text-restoration.

Related work
Natural language processing (NLP) has dealt with tasks akin to text restoration. Indeed, standard count-based n-gram language models (LM) share with epigraphists the "parallel-finding" approach. N-gram models are outperformed by neural language models, which operate at a word-level (Mikolov et al., 2010(Mikolov et al., , 2011, at a subword-or character-level (Sutskever et al., 2011;Mikolov et al., 2012;Botha and Blunsom, 2014), or a combination of both, known as character-aware language models (Miyamoto and Cho, 2016;Kim et al., 2016;Hwang and Sung, 2017). Despite our efforts to include BERT (Devlin et al., 2018) in our evaluation, we found that the excessive resources required did not allow for training on a single GPU. Text restoration also shares similarities with machine reading comprehension (Hermann et al., 2015;Kočiskỳ et al., 2018), and cloze deletion tests (Hill et al., 2016;Bajgar et al., 2017;Fedus et al., 2018;Xie et al., 2018;Zhang et al., 2018). Although word-level language modelling is capable of capturing context information more efficiently than character-level alternatives, damaged inscriptions preserve only limited parts of words, complicating the learning of representations. To overcome this issue, PYTHIA works simultaneously at both a character-and a wordlevel, thereby capturing long-term dependencies ("context information").

Generating PHI-ML
Due to availability of digitised epigraphic corpora, PYTHIA has been trained on ancient Greek (henceforth, "AG") inscriptions, written in the ancient Greek language between 7 th century BCE and 5 th century CE. We chose AG epigraphy as a case study for two reasons: a) the variability of contents and context of the AG epigraphic record makes it an excellent challenge for NLP; b) several digital AG textual corpora have been recently created, the largest ones being PHI (The Packard Humanities Institute, 2005; Gawlinski, 2017) for epigraphy; Perseus (Smith et al., 2000) and First1KGreek (Crane et al., 2014) for ancient literary texts.
When restoring damaged AG inscriptions, the epigraphists' conjectures on the total number of missing characters are guided by grammatical and syntactical considerations, as well as by the reconstructed graphical layout of the inscription. Conjectured missing characters are conventionally marked with hyphens, one hyphen equating to one missing character. Additionally, epigraphists traditionally convert edited texts to lower case and add punctuation and diacritics, which are generally absent from the original inscription. These conventions were also used in PHI.
Because human annotations in PHI were noisy and often syntactically inconsistent (Iversen, 2007), we wrote a pipeline to convert it into a machine actionable text. We first computed the character frequencies and standardised the AG alphabet to include all core characters, including all accentuation (147 characters), numbers, spaces and punctuation marks. Two additional characters were introduced: '-' representing a missing character, and '?' signifying a character to be predicted. Then we wrote regular expressions to replace all AG numerical notations appearing in the texts with 0 to avoid numerical correlations, strip the remaining punctuation marks, remove the conventional epigraphical symbols surrounding certain characters ("Leiden Conventions"), and discard notes whose content was not in Greek. We then proceeded to clear human comments, fix the spacing and cases of duplicate punctuation, and filtered the resulting text so as to retain only the restricted alphabetical characters. The texts with fewer than 100 characters were also discarded. Lastly, we matched the number of missing characters with those conjectured by epigraphists, thereby converting the length value to an equal number of '-' symbols.
The resulting dataset is named PHI-ML, and consists of more than 3.2 million words ( Table 1). The inscriptions whose PHI IDs ended in {3, 4} (every inscription in PHI was assigned a unique identifier when the original corpus was created) were held out and used respectively as test and validation sets.   Bahdanau et al. (2014). The encoder takes an inscription text as input, where the symbol '-' denotes the missing characters, and '?' the blanks to be predicted. The input characters are first passed through a lookup table with learnable embedding vectors. Next, the encoded sequence is used as input for the decoder, which is trained to predict the content of the '?' characters, as shown in Figure 2. Attention allows the decoder to "attend" to parts of the input sequence relevant to the current output, thus improving the modelling of long-term dependencies. To further improve performance, we designed PYTHIA's encoder to take an additional input stream of word embeddings, as it is difficult to model the word-level context using only character-level information. Thus, we generated a list of the 100k most frequent words appearing in PHI-ML, and using a separate lookup table we concatenated at each time-step the embedding of each character with the embedding of the word it belongs to. Words that do not appear in the list, or that contain missing characters were mapped to 'unk', an embedding for unknown words. Figure 2 illustrates PYTHIA processing the phrase μηδέν ἄγαν. Finally, to allow better modelling we used a bidirectional LSTM encoder and refer to this architecture as PYTHIA-BI-WORD. Further details are given in Appendix A.
Obtaining suggestions. To better aid the epigraphist's task, PYTHIA returns multiple predictions as well as the level of confidence for each result, rather than a single prediction per text restoration. Specifically, we provide a set of the Top 20 predictions decoded using beam search.

Experimental evaluation
The ground-truths for incomplete epigraphic texts were lost over millennia. Consequently, in order to generate a ground-truth sequence, we artificially removed part of the input text and treated this as the ground-truth sequence. On each training step we selected an inscription and sampled a start index and a length value ∈ [100, 1000], and extracted the context text , which was then used as input. Within , we sampled a new start index and length ∈ [1, 10] to select the target sequence ; its characters' positions were replaced with the special symbol '?', which denotes the positions to be predicted. The test and validation sets used the maximum context length. Beam search with a beam width of 100 was used to decode hypothesis. To simplify comparisons, all AG accentuation was discarded, as inputting accents was timeconsuming for the human evaluations described in the following paragraph. This decision did not noticeably influence the reported scores.

Methods evaluated
ANCIENT HISTORIAN. Because text restoration is an extremely time-consuming task even for an expert epigraphist, we set out to evaluate the difficulty of the restoration task at hand -and thereby judge the impact of our work -with the help of two doctoral students with epigraphical expertise. The scholars were allowed to use the training set to search for "parallels", and made an average of 50 restorations in 2 hours, with a 57.3% character error rate (CER). LM PHILOLOGY. To evaluate the performance of a model using "parallels", we trained a LM. Since large parts of the text are garbled, making complete words unidentifiable, and because BERT was not an option, the LM works at a character-  PYTHIA-BI-WORD. This is our proposed model of choice, which uses a bidirectional LSTM and both characters and words as inputs.

Results
The aforementioned methods were evaluated using: a) the character error rate (CER) of the top prediction and the target sequence, b) the Top-20 accuracy score, where we ascertain whether the ground-truth sequence exists within the first 20 predictions. The latter evaluates the effectiveness of PYTHIA as an assistive tool providing restoration suggestions to epigraphists. As shown in Table 2, the ancient historians' restorations had a CER of 57.3%, which is telling of the difficulty of the task. The language model trained on epigraphic datasets performed slightly better, with a CER of 57.3%. Interestingly, the two attempts to use larger philological datasets performed worse. This is very likely due to a divergence in epigraphical and literary cultures. The CER of the unidirectional PYTHIA-UNI and the bidirectional PYTHIA-BI alternatives were 42.2% and 32.5% respectively. The top score was therefore achieved by the bidirectional PYTHIA-BI-WORD, which took both word and character embeddings as inputs, with a CER of 30.1%.  thermore, the ground-truth appeared among the 20 most probable predictions of PYTHIA-BI-WORD 73.5% of the times, which indicates that it could be a uniquely effective assistive tool.

The importance of context
The presence of context information is a determining factor in the accuracy of epigraphic restorations. We therefore evaluated the impact of different textual lengths acting as augmented context on the Top-20 accuracy measure of PYTHIA. As can be seen in Figure 3, the correlation between the "context length" and the predictive performance of our model is positive. Specifically, the performance peaks around 500 characters of context length. Furthermore, Figure 3 exemplifies the increased difficulty faced by the model when only a short context length (e.g. 20 characters) is offered. The latter scenario recalls the similar difficulties encountered by string-matching and "parallel" search approaches, where the search queries would often be short.

Visualising PYTHIA's attention
We set up an example modifying lines b.8 -c.5 of the inscription MDAI(A) 32 (1907) 428, 275 (PHI ID PH316753), to evaluate PYTHIA's receptiveness to context information and visualise the attention weights at each decoding step. In the text of Figure 4, the last word is a Greek personal name ending in -ου. We set ἀπολλοδώρου ("Apollodorou") as the personal name, and hid its first 9 characters. This name was specifically chosen because it already appears within the input text. Figure 4 illustrates the attention weights for decoding the first 4 missing characters. To aid visualisation, the weights were separately scaled between 0 and 1 within the area of the characters to be predicted ('?') in green, and of the rest of the text in blue; the magnitude is represented by the colour intensity. As can be seen, PYTHIA is attending to the contextually-relevant parts of the text: specifically, ἀπολλοδώρου. The name is correctly predicted. As a litmus test, we substituted ἀπολλοδώρου in the input text with another personal name of the same length: ἀρτεμιδώρου ("Artemidorou"). The predicted sequence alters accordingly to ἀρτεμιδώρ, thereby illustrating the importance of context in the prediction process.

Restoring full texts
We then applied PYTHIA iteratively in order to predict all the missing text of an AG inscription, comparing PYTHIA's predictions with an edition of reference (Rhodes and Osborne, 2003). In Figure 5 the correct restorations are highlighted in colour blue and erroneous ones are in purple.
In a real-world scenario, PYTHIA would provide more than one hypothesis to the epigraphist. The ground-truth sequence did in fact exist within the Top-20 hypotheses in nearly all cases, illustrating the efficacy of such technologies when paired with human decision-making.

Conclusions
PYTHIA is the first ancient text restoration model of its kind. Our experimental evaluation and ablation studies illustrate the validity of our design decisions, and illuminate the ways PYTHIA can assist, guide and advance the ancient historian's task -and digital humanities proper. The combination of machine learning and epigraphy has the potential to impact meaningfully the study of inscribed textual cultures, both ancient and modern. By open-sourcing PYTHIA, and PHI-ML's processing pipeline, we hope to aid future research and inspire further interdisciplinary work.