Minoan linguistic resources: The Linear A Digital Corpus

This paper describes the Linear A/Minoan digital corpus and the approaches we applied to develop it. We aim to set up a suitable study resource for Linear A and Minoan. Firstly we start by introducing Linear A and Minoan in order to make it clear why we should develop a digital marked up corpus of the existing Linear A transcriptions. Secondly we list and describe some of the existing resources about Linear A: Linear A documents (seals, statuettes, vessels etc.), the traditional encoding systems (standard code numbers referring to distinct symbols), a Linear A font, and the newest (released on June 16th 2014) Unicode Standard Characters set for Linear A. Thirdly we explain our choice concerning the data format: why we decided to digitize the Linear A resources; why we decided to convert all the transcriptions in standard Unicode characters; why we decided to use an XML format; why we decided to implement the TEI-EpiDoc DTD. Lastly we describe: the developing process (from the data collection to the issues we faced and the solving strategies); a new font we developed (synchronized with the Unicode Characters Set) in order to make the data readable even on systems that are not updated. Finally, we discuss the corpus we developed in a Cultural Heritage preservation perspective and suggest some future works.

Firstly we start by introducing Linear A and Minoan in order to make it clear why we should develop a digital marked up corpus of the existing Linear A transcriptions.
Secondly we list and describe some of the existing resources about Linear A: Linear A documents (seals, statuettes, vessels etc.), the traditional encoding systems (standard code numbers referring to distinct symbols), a Linear A font, and the newest (released on June 16th 2014) Unicode Standard Characters set for Linear A.
Thirdly we explain our choice concerning the data format: why we decided to digitize the Linear A resources; why we decided to convert all the transcriptions in standard Unicode characters; why we decided to use an XML format; why we decided to implement the TEI-EpiDoc DTD.
Lastly we describe: the developing process (from the data collection to the issues we faced and the solving strategies); a new font we developed (synchronized with the Unicode Characters Set) in order to make the data readable even on systems that are not updated. Finally, we discuss the corpus we developed in a Cultural Heritage preservation perspective and suggest some future works.

Introduction to Linear A and Minoan
Linear A is the script used by the Minoan Civilization (Cotterell, 1980) from 2500 to 1450 BC.

Writing system
Time span Cretan Hieroglyphic 2100 -1700 BC Linear A 2500 -1450 BC Linear B 1450 -1200 BC The Minoan Civilization arose on the island of Crete in the Aegean Sea during the Bronze Age. Minoan ruins and artifacts have been found mainly in Crete but also in other Greek islands and in mainland Greece, in Bulgaria, in Turkey and in Israel.
Linear A is not used anymore and, even after decades of studies (it was discovered by Sir Arthur Evans around 1900 (Evans, 1909)), it still remains undeciphered.
All the assumptions and hypotheses made about Linear A and Minoan (its underlying language) are mainly based on the comparison with the well known Linear B, the famous child system originated by Linear A. In fact, Linear B was fully deciphered during the 1950s by Michael Ventris 1 and was found to encode an ancient Greek dialect used by the Mycenaean civilization.
Archaeologist Arthur Evans named the script 'Linear' because it consisted just of lines inscribed in clay (Robinson, 2009) while, in the same period (as shown in Table 1), Cretan hieroglyphs were more pictographic and three-dimensional .
Even if many symbols are shared by both Linear A and Linear B, it has not been possible to find intelligible words within inscriptions in Linear A by applying Linear B segmentation and phonemes.
Linear A consists of hundreds of symbols probably having syllabic, ideographic, and semantic values. Many of the Linear A symbols that are
There is also an interesting attempt (Younger, 2000b) to decipher single words, specifically toponyms, by applying Linear B phonetic values to the symbols shared by both Linear A and Linear B and following the assumption that toponyms are much more likely to survive as loans in Mycenaean Greek (written in Linear B); we show an example of this approach in Table 2. In the next sections we describe the available existing resources concerning Linear A and the Linear A Digital Corpus: why and how we developed it.

Linear A available resources
Even if Linear A and Linear B were discovered more than one century ago, Linear A has not been deciphered yet. Nevertheless, many scholars worked on collecting and organizing all the available data in order to study and to decipher the script and the language.
Probably due to the fact that only historical linguists, philologists and archaeologists attempted to collect and organize all the existing data, nowadays a rich and well organized digital corpus is still not available.
In this section we describe all the available Linear A resources, including both physical documents and digital data.  Table 3: Indexed types of support (Younger, 2000e).

Linear A documents
Linear A was written on a variety of media, such as stone offering tables, gold and silver hair pins, and pots (inked and inscribed). The clay documents consist of tablets, roundels, and sealings (one-hole, two-hole, and flat-based).
Roundels are related to a "conveyance of a commodity, either within the central administration or between the central administration and an external party" (Palmer, 1995;Schoep, 2002). The roundel is the record of this transaction that stays within the central administration as the commodity moves out of the transacting bureau (Hallager, 1996). Two-hole sealings probably dangled from commodities brought into the center; onehole sealings apparently dangled from papyrus/parchment documents; flat-based sealings (themselves never inscribed) were pressed against the twine that secured papyrus/parchment documents (Younger, 2000g;Schoep, 2002) as shown by photographs (Müller, 1999), (Müller, 2002) of the imprints that survive on the underside of flat-based sealings.
There are 1,427 Linear A documents containing 7,362-7,396 signs, much less than the quantity of data we have for Linear B (more than 4,600 documents containing 57,398 signs) (Younger, 2000f).

Godart and Olivier's Collection of Linear A Inscriptions
There is a complete and organized collection of Linear A documents on a paper corpus, the GORILA Louis Godart and Jean-Pierre Olivier, Recueil des inscriptions en Linéaire A (Godart and Olivier, 1976).  Godart and Olivier have indexed the documents by original location and type of support, following the Raison-Pope Index (Raison and Pope, 1971).
For example, the document AP Za 1 is from AP = Apodoulou and the support type is Za = stone vessels as shown in Table 3. Younger (2000h) provides a map with all the Cretan sites and one with all the Greek non-Cretan sites (Younger, 2000i).
Godart and Olivier also provide referential data about conservation places (mainly museums), and periodization (for example: EM II = Second Early Minoan).
Since 1976, this has been the main source of data and point of reference about Linear A documents and it has set up the basis for further studies. Even recent corpora, such as the Corpus transnuméré du linéaire A (Raison and Pope, 1994), always refer to GORILA precise volumes and pages describing each document.

John G. Younger's website
Younger (2000j) has published a website that is the best digital resource available (there is another interesting project, never completed, on Yannis Deliyannis's website 2 ). It collects most of the existing inscriptions (taking GORILA as main source of data and point of reference) transcribed as Linear B phonetic values (like the KU-NI-SU transcription above).
The transcriptions are kept up to date and a complete restructuring in June 2015 has been announced (Younger, 2000j).

GORILA symbols catalogue
Many transcription systems have been defined.
The first one has been proposed by Raison and Pope (1971) and uses a string composed by one or two characters (Lm, L or Lc depending on the symbol, respectively metric, phonetic or compound) followed by a number, for example: L2.
This system has been widely used by many scholars such as David Woodley Packard (president of the Packard Humanities Institute 3 ), Colin Renfrew and Richard Janko (Packard, 1974;Renfrew, 1977;Janko, 1982).
The second one, used in the GORILA collection (Godart and Olivier, 1976) and on John G. Younger's website, consists of a string composed by one or two characters (AB if the symbol is shared by Linear A and Linear B, A if the symbol is only used in Linear A) followed by a number and eventually other alphabetical characters (due to addenda and corrigenda to earlier versions), for example: AB03.
Many scholars transcribe the symbols shared by Linear A and B with the assumed phonetical/syllabic transcription. This syllabic transcription is based on the corresponding Linear B phonetic values. Younger (2000a) provides a conversion table of Pope and Raison's transcription system, GO-RILA's transcription system and his own phonetic/syllabic transcription system.
Developing our corpus, we worked mainly on Younger's syllabic and GORILA transcriptions, because the Unicode Linear A encoding is broadly based on the GORILA catalogue, which is also the basic set of characters used in decipherment efforts 4 . We provide an example of different transcriptions for the same symbol in Table 4. As can be noticed, the Unicode encoding is based on the GORILA transcription system.

Linear A Font
The best Linear A Font available is LA.ttf, released by D.W. Borgdorff 5 in 2004.
In this font some arbitrary Unicode positions for Latin characters are mapped to Linear A symbols.
On one hand this allows the user to type Linear A symbols directly by pressing the keys on the keyboard; on the other hand, only transliterations can be produced. The text eventually typed internally will be a series of Latin characters.
It should be remarked that this font would not be useful to make readable a Linear A corpus that is non-translittered and encoded in Unicode.

Unicode Linear A Characters Set
On June 16th 2014, Version 7.0 of Unicode standard was released 6 , adding 2,834 new characters and including, finally, the Linear A character set.
Linear A block has been set in the range 10600-1077F and the order mainly follows GORILA's one 7 , as seen in Table 4.
This Unicode Set covers simple signs, vase shapes, complex signs, complex signs with vase shapes, fractions and compound fractions.
This is a resource that opens, for the first time, the possibility to develop a Linear A digital corpus not consisting of a transliteration or alternative transcription.

Corpus data format
Many scholars have faced the issues for data curation and considered various possibilities.
Among all the possible solutions, we chose to develop the Linear A Digital Corpus as a collection of TEI-EpiDoc XML documents.
In this section we explain why.

Why Digital?
Many epigraphic corpora have begun to be digitalized; there are many reasons to do so. A digital corpus can include several representations of the inscriptions (Mahoney, 2007): • pictures of the original document; • pictures of drawings or transcriptions made by hand simplifying the document; • diplomatic transcriptions; • edited texts; • translations; • commentaries.
Building a database is enough to get much richer features than the ones a paper corpus would provide. The most visible feature of an epigraphic database is its utility as an Index Universalis (Gómez Pantoja and Álvarez, 2011); unlike hand-made indexes, there is no need to constrain the number of available search-keys.
Needless to say, the opportunity to have the data available also on the web is valuable.

Why Unicode?
Text processing must also take into account the writing systems represented in the corpus.
If the corpus consists of inscriptions written in the Latin alphabet, then the writing system of the inscriptions is the same as that of the Western European modern languages used for meta-data, translations, and commentaries.
In our case, unluckily, we have to deal with Linear A, so we need to find a way to represent our text.
Scholars objected to epigraphic databases on the ground of its poor graphic ability to represent non-Latin writing systems (García Barriocanal et al., 2011).
This led to the use of non-standard fonts in some databases which probed to be a bad move, compromising overall compatibility and system upgrading.
This approach is appealing because if the corpus needs to be printed, sooner or later fonts will be a need in all cases.
The font-based solution assumes that all the software involved can recognize font-change markers. Unluckily, some Database Management Systems (DMSs) do not allow changes of font within a text field and some export or interchange formats lose font information.
When the scripts of the corpus are all supported, which will be the case for any script still used by a living language, Unicode is a better approach. Despite Minoan not being a living language, Linear A is finally part of the Unicode 7.0 Character Code Charts 8 but some sign groups conventionally interpreted as numbers have no Unicode representation.

Why XML?
Until not so long ago, markup systems have always involved special typographical symbols in the text-brackets, underdots, and so on.
Some epigraphers see XML as a natural transformation of what they have always done, with all the additional benefits that come from standardization within the community.
There is a growing consensus that XML is the best way to encode text.
Unfortunately, the special brackets, underdots, and other typographical devices may not be supported by the character set of the computer system to be used.
A key incentive for using XML is the ability to exchange data with other projects.
It is convenient to be able to divide the information in many layers: cataloging, annotating, commenting and editing the inscriptions.
In some cases, merging different layers from different projects could be a need (for example when each of these projects is focused on a specific layer, for which provides the best quality), as a consequence the resulting data should be in compatible forms.
If the projects use the same Document Type Definition (DTD), in the same way, this is relatively easy.
While corpora that store their texts as wordprocessor files with Leiden markup can also share data, they must agree explicitly on the details of text layout, file formats, and character encodings.
With XML, it is possible to define either elements or entities for unsupported characters.
This feature is particularly interesting in our case, giving a solution for the numbers representation (Linear A numbers, except for fractions, have no Unicode representation). Suppose you want to mark up the sign group , conventionally interpreted as the number 5, in the XML. As specified in the TEI DTD, this could be expressed as <g ref="#n5"/>, where the element g indicates a glyph, or a non-standard character and the attribute value points to the element glyph, which contains information about the specific glyph. An example is given in Figure 1.
Alternatively, the project might define an entity to represent this character. Either way, the XML text notes that there is the Linear A number 5, and the later rendering of the text for display or printing can substitute the appropriate character in a known font, a picture of the character, or even a numeral from a different system. Such approaches assume that tools are available for these conversions; some application, transformation, or stylesheet must have a way to know how to interpret the given element or entity.
The usage of XML provides two advantages: in first place, it makes possible the encoding of the characters that occur in the text (as shown above); in second place, it's really useful for encoding meta-information.

Why EpiDoc?
If a project decides to use XML, the most appropriate DTD (or schema) to be used needs to be chosen. As in every other humanities discipline, the basic question is whether to use a general DTD, like the TEI, or to write a project-specific one. Some projects need DTDs that are extremely specific to the types of inscriptions they are dealing with, instead other projects prefer to rely on existing, widely used DTDs. Mahoney (2007) has deeply analyzed all the digitization issues, taking into account all the advantages and disadvantages of different approaches; her conclusion is that it's best to use EpiDoc 9 an XML encoding tool that could be also used to write structured documents compliant with the TEI standard 10 .
The EpiDoc DTD is the TEI, with a few epigraphically oriented customizations made using the standard TEI mechanisms. Rather than writing a DTD for epigraphy from scratch, the Epi-Doc group uses the TEI because TEI has already addressed many of the taxonomic and semantic challenges faced by epigraphers, because the TEIusing community can provide a wide range of best-practice examples and guiding expertise, and because existing tooling built around TEI could easily lead to early and effective presentation and use of TEI-encoded epigraphic texts (Mahoney, 2007).
The TEI and EpiDoc approaches have already been adopted by several epigraphic projects (Bodard, 2009), such as the Dêmos project (Furman University) and the corpus of Macedonian and Thracian inscriptions being compiled at KERA, the Research Center for Greek and Roman Antiquity at Athens (Mahoney, 2007).
Also other scholars evaluate EpiDoc as a suitable choice. Felle (2011) compares the EAGLE (Electronic Archive of Greek and Latin Epigraphy 11 ) project with the EpiDoc existing resources, viewing these resources as different but complementary. Álvarez et al. (2010) and Gómez Pantoja and Álvarez (2011) discuss the possibility of sharing Epigraphic Information as EpiDoc-based Linked Data and describe how they implemented a relational-to-linked data solution for the Hispania Epigraphica database. Cayless (2003) evaluates EpiDoc as a relevant digital tool for Epigraphy allowing for a uniform representation of epigraphic metadata.
The EpiDoc guidelines are emerging as one standard for digital epigraphy with the TEI.
EpiDoc is not the only possible way to use the TEI for epigraphic texts but the tools, documentation, and examples 12 make it a good environment for new digitization projects as ours.

EpiDoc structure
An EpiDoc document is structured as a standard TEI document with the teiHeader element including some initial Desc sections (fileDesc, encodingDesc, profileDesc, revisionDesc, etc) containing metadata, general information and descriptions (here we annotated place, period, kind of support and specific objects/fragments IDs). An interesting use of encodingDesc is shown in Figure 1 above: the gliph element has to be defined inside its parent element charDecl and its grandparent element encodingDesc.
The teiHeader element is followed by the text element including the body element composed by a series of unnumbered <div>s, distinguished by their type attributes (we show an example of the Epidoc <div> element in Figure 2).
One advantage of structured markup is that editors can encode more information about how certain a particular feature is. The date of an inscription, for example, can be encoded as a range of possible dates. EpiDoc includes the TEI <certainty> element and the cert attribute to encourage editors to say whether or not they are completely confident of a given reading. After some discussion, the EpiDoc community (Mahoney, 2007) decided that certainty should be expressed as a yes-or-no value: either the editor is certain of the reading, or not. Gradual certainty is too complicated to manage and is best explained in the commentary.

Developing the Linear A Corpus
The hope that computational approaches could help decipher Linear A, along with the evident lack of rich digital resources in this field, led us to develop this new resource. In this section we describe which issues we faced and which solving strategies we used.

Data Collection
Luckily the existence of Younger's website and GORILA volumes, together with the Raison-Pope Index, made possible a semi-automatic collection process, starting from syllabic transcriptions taken from Younger's website (with his permission), converting them in Unicode strings through Python scripts and acquiring all the metadata provided in Younger's transcriptions (location and support IDs, conservation place, periodization etc.).
Younger's resources on his website consist of two HTML pages, one containing inscriptions from Haghia Triada (that is the richest location in terms of documents found there) (Younger, 2000k) and the other containing documents from all the other locations (Younger, 2000l).
Younger's transcriptions are well enriched with metadata. The metadata convey the same information found in GORILA, including the Raison-Pope Index, plus some additional description of the support (this was not necessary in GORILA volumes, where the transcriptions are shown just next to the documents pictures) and the reference to the specific GORILA volume and pages.

Segmentation Issues
When working on ancient writing systems, segmentation issues are expected to come up. John G. Younger explains (Younger, 2000c) that in Linear A separation is mainly indicated in two ways: first, by associating sign groups with numbers or logograms, thereby implying a separation; second, by placing a dot between two sign groups, thereby explicitly separating the sign groups or between a sign group and some other sign like a transaction sign or a logogram. Younger also explains that in texts that employ a string of sign groups, dots are used to separate them and this practice is most notable on non-bureaucratic texts and especially in religious texts.
On his website, Younger also covers the hyphenization issue (Younger, 2000d), explaining that in some cases we find a split across lines and the reason may involve separating prefixes from base words (the root of a sign group) or base words from their suffixes. As Younger points out, this hypothesis would require evidence showing that affixes are involved. The hyphenization issue is more complex to solve because a 'neutral' resource should avoid transcriptions implying a well known segmentation for Linear A sign groups. In Younger's transcriptions, split sign groups are reunified in order to make it clearer when a known sign group is there. Instead, our digital collection keeps the text as it is on the document, all the information about interpretations of such kind can be stored separately.

Obtaining Unicode transcriptions
We managed to obtain Unicode encoded transcriptions by automatically converting Younger's phonetic transcriptions to GORILA transcriptions (manually checked against GORILA volumes) and then by automatically converting GORILA transcriptions to Unicode codes and printing them as Unicode characters (UTF-8 encoding). In order to create the syllables-to-GORILA and the GORILA-to-Unicode dictionaries, we took into account Younger's conversion table mentioned in Subsection 2.4 and the official Unicode documentation (containing explicit Unicode-to-GORILA mapping information). All these processing steps have been implemented through Python scripts.

XML annotation
Once collected the whole corpus encoded in Unicode, we automatically added part of the XML annotation through a python script. These documents have been later manually corrected and completed, checking against GORILA volumes.

A new Linear A font
Before the Unicode 7.0 release, there was no way to visualize Unicode characters in the range 10600-1077F. Even now, systems that are not updated may have trouble to visualize those characters. Some implementations for Unicode support in certain contexts (for example for L A T E X's output) are not always up-to-date, so it is not obvious that the fonts for the most recent characters sets are available. We decided to develop a new Linear A font, solving the main issue found in LA.ttf (wrong Unicode positions). Starting from the official Unicode documentation, we created a set of symbols graphically similar to the official ones and aligned them to the right Unicode positions. We decided to name the font John_Younger.ttf to show our appreciation for Younger's work. He made the results of GORILA available to a wider public on digital media; this is the same goal we want to pursue by developing and distributing this font. We released the font file at the following URL: http://openfontlibrary. org/en/font/john-younger.

The Linear A Digital Corpus as cultural resource
As stated by European Commission (2015) and UNESCO (2003), the meaning of the notion of cultural heritage does not apply just to material objects and works of art, but also to 'intangible cultural heritage', as traditions and creative expressions. In this perspective, linguistic corpora fit perfectly this definition; in fact, they contain information about tradition, knowledge and lifestyle of a certain culture. Despite the fact that the Minoan language has not yet been deciphered, we know that the Linear A corpus provides interesting information concerning economy, commerce and religion.
As mentioned in Subsection 2.1, Schoep (2002) made a critical assessment of the Linear A tablets and their role in the administrative process, studying the physical supports.
Ruth Palmer (1995) made a deep study of commodities distributions (listing precise quantities and places) among Minoan centers, even without a full understanding of documents contents. As Palmer points out, 'the ideograms for basic commodities, and the formats of the Linear A texts are similar enough to their Linear B counterparts to allow valid comparison of the types and amounts of commodities which appear in specific contexts'. So, it's possible to have 'an idea of the focus of the economy' and of 'the scale and complexity of the transactions'. From the linear A tablets, we can infer information about the resources management and administration system of Minoan centers.
Van den Kerkhof and Rem (2007) analyzed the Minoan libation formulas: religious inscriptions on cups, ladles and tables that were used in the offerings of oil and other powerful drinks at dawn. The priestesses that carried out the Minoan libation ritual used all kinds of utensils, and they often inscribed their sacred formulas onto these objects. Around thirty of these texts have survived (whole or in part) on libation tables, ladles and vases, written in various kinds of handwriting. Transcripts of these religious inscriptions are available from Consani et al. (1999) and from John G. Younger (2000m) on his website. As noticed by Duhoux (1989) the Minoan libation formulas have a fixed structure with variable elements. In fact, some studies (Davis, 2014) about Minoan syntax have been made by observing the sign groups order found in these regular formulas. More importantly, the presence of olive-like ideograms could tell us that the Minoans used olive oil for libation (Van den Kerkhof and Rem, 2007). Beyond all these parts of the Minoan cultural heritage already available, a huge part is preserved there too: the Minoan language, with its hidden stories reflecting the life of a civilization. We hope that our contribution can be useful to the community and that the Minoan, in its digital form, may finally be deciphered through computational approaches.

Future Work
We are working on XSL style sheets in order to create suitable HTML pages. All the data will be freely available and published at the following URL: http://ling.ied.edu.HK/ gregoire/lineara. A further step will be developing a web interface to annotate, and dynamically enrich the corpus information.