Sign Clustering and Topic Extraction in Proto-Elamite

We describe a first attempt at using techniques from computational linguistics to analyze the undeciphered proto-Elamite script. Using hierarchical clustering, n-gram frequencies, and LDA topic models, we both replicate results obtained by manual decipherment and reveal previously-unobserved relationships between signs. This demonstrates the utility of these techniques as an aid to manual decipherment.


Introduction
In the late 19th century, excavations at the ancient city of Susa in southwestern Iran began to uncover clay tablets written in an unknown script later dubbed 'proto-Elamite'.Over 1,500 tablets have since been found at Susa, and a few hundred more at sites across Iran, making it the most widespread writing system of the late 4th and early 3rd millennia BC (circa 3100-2900 BC) and the largest corpus of ancient material in an undeciphered script. 1  Proto-Elamite (PE) is the conventional designation of this script, whose language remains unknown but was presumed by early researchers as likely to be an early form of Elamite.A number of features of the PE writing system are understood.These include tablet format and direction of writing, the numeric systems, and the ideographic associations of some non-numeric signs, predominantly those for livestock accounting, agricultural production, and possibly labor administration.Yet the significance of the majority of PE signs, the nature of those signs (syllabic, logographic, ideographic, or other) and the linguistic context(s) of the texts remain unknown.It was recognized from the outset, due to the features of the script, that all the proto-Elamite tablets were administrative records, rather than historical or literary compositions (Scheil, 1905).
Texts are written in lines from right to left, but are rotated in publication to be read from top to bottom (then left to right) following academic practice for publishing the contemporary protocuneiform tablets.The content of a text is divided into entries, logical units which may span more than one physical line.The entry itself is a string of non-numeric signs whose meanings are for the most part undeciphered.Each entry is followed by a numeric notation in one of several different numeric systems, which quantifies something in relation to the preceding entry.This serves to mark the division between entries.An important exception exists in what are currently understood to be 'header' entries: these can present information that appears to pertain to the text as a whole, and are followed directly by the text's first content entry with no intervening numeric notation.A digital image and line drawing of a simple PE text along with transliteration are shown in Figure 1.
Although a complete digital corpus of PE texts exists (Section 2), it has not been studied using the standard toolkit of data exploration techniques from computational linguistics.The goals of this paper are threefold.By applying a variety of computational tools, we hope to i. promote interest in and awareness of the problems surrounding PE decipherment ii. demonstrate the effectiveness of computational approaches by reproducing results previously obtained by manual decipherment iii.highlight novel patterns in the data which may inform future decipherment attempts We hope to show that interesting data may be extracted from the corpus even in the absence of a complete linguistic decipherment.To encourage further study in this vein, we are also releasing all data and code used in this work as part of an online suite of data exploration tools for PE.2Additional figures and interactive visualizations are also available as part of this toolkit.

Conventional Decipherment Efforts
Studies towards the decipherment of PE can be summarised by a relatively short bibliography of serious efforts (Englund, 1996). 3Stumbling blocks to decipherment have included inaccuracies in published hand-copies of the texts, a lack of access to high-quality original images, and the associated difficulty in drawing up an accurate signlist and producing a consistently-rendered full transcription of the corpus.Members of the Cuneiform Digital Library Initiative (CDLI) have been remedying these deficiencies over the past two decades, and the PE script can now boast (i) a working signlist with a consistent manner of transcribing signs in ASCII, and (ii) an open-access, searchable database hosting the entire corpus in transcription, alongside digital images and/or hand-copies of almost every text. 4istorically, specialists of PE have operated on a working hypothesis that it may be, like later Sumerian cuneiform, to some extent a mixed system of ideographic or logographic signs alongside signs that may represent syllables.However, the level of linguistic content represented in both PE and proto-cuneiform has been called into question (Damerow, 2006), and the presence of a set of syllabic signs in PE is yet to be proven.
The strict linear organisation of signs in PE is the earliest such known to a writing system: proto-cuneiform arranged signs in various ways within cases (and sometimes subcases), and only in cuneiform from several hundred years later did scribes begin to consistently write in lines with one sign following the next.However, it is not clear to what extent the linear sign organization of PE reflects the flow of spoken language as in later writing systems. 5nalysis of sign and entry ordering in the texts has also revealed some tabular-like organising principles familiar from proto-cuneiform.Longer sequences of signs can often be broken down into constituent parts appearing to follow hierarchical ordering patterns apparently based upon administrative (rather than phonetic/linguistic) principles, and hierarchies can be seen across entries as well (Hawkins, 2015;Dahl et al., 2018).
Traditional linguistic decipherment efforts have not yet succeeded in identifying a linguistic context for PE, though progress has been made, for example in positing sets of syllabo-logographic signs thought to be used to write personal names (PNs).We refer to Meriggi's (1971:173-174) syllabary as shorthand for these signs, as he was the first to identify such a set and his work has since been closely imitated (Desset 2016;Dahl 2019:85).Although he called it a syllabary, Meriggi was aware that the signs might not prove to be syllabic and that object or other signs might remain mixed in.
Continued efforts to establish the organizational principles of the PE script and to isolate possible syllable sequences or PNs may be advanced by computational techniques, which can be used to evaluate hypotheses much faster than purely manual approaches.In this endeavour it is necessary to remember that although early writing encodes meaningful information, that information may or may not be linguistic (Damerow, 2006).Although it is not known why PE disappeared after a relatively short period of use, one of several possibilities is that this relates to the way it represents information, perhaps providing a poorer, less versatile encoding compared to later cuneiform with its mixed syllabo-logography.

Data
All data in this work are based on the PE corpus provided by the CDLI.After removing tablets which only bear unreadable or numeric signs, this dataset comprises 1399 distinct texts.Most of these are very short: the mean text length is 27 readable signs, of which only 10 are non-numeric on average.Long texts do exist, however, up to a maximum length of 724 readable signs of which 198 are non-numeric.
Our working signlist (extracted from the tranin-line structure that is more prone to language coding than proto-cuneiform..." scribed texts) contains 49 numeric signs and 1623 non-numeric signs.Of these, 287 are 'basic' signs, and 1087 are labeled as variants due to minor graphical differences.Sign variants are denoted by ∼, as in M006∼b, a variant of the basic sign M006.In an on-going process, analysis of the corpus aims to confirm whether sign variants are semantically distinct, or reflect purely graphical variation.Where the latter case is understood, the sign is given a numeric rather than alphabetic subscript, as in M269∼1.The remaining 249 non-numeric signs are compounds called complex graphemes which are made up of two or more signs in combination, as in |M136+M365|.
Future work is required to establish which sign variants are meaningfully distinct from their base signs; in the absence of such work, we have chosen to treat all variants as distinct until proven otherwise.Our models give interpretable results under this assumption, suggesting this is a reasonable approach.There are, however, cases where collapsing sign variants together would seem to affect our results, and we highlight these where relevant.

Hierarchical Sign Clustering
Manual decipherment of PE has proceeded in part by identifying that some signs occur in largely the same contexts as other signs.This has produced groupings of signs into "owners", "objects", and other functionally related sets (Dahl, 2009).For example, M388 and M124 are known to be parallel "overseer" signs which appear in alternation with one another (Dahl et al., 2018:25).
In the same vein, we have investigated techniques for clustering signs hierarchically based on the way they occur and co-occur within texts.Our work considers three approaches to sign clustering: a neighbor-based clustering groups signs based on the number of times each other sign occurs immediately before or after that sign in the corpus; an HMM clustering groups signs based on the emission probabilities of a 10 state hidden Markov model (HMM) trained on the corpus; and a generalized Brown clustering groups signs as described in Derczynski and Chester 2016.By using three different clustering techniques, we can search for clusters which recur across all three methods to maximize the likelihood of finding those that are meaningful.This reduces the impact of noise in the data, which is especially useful given the small size of the PE corpus.

Clustering Evaluation
We identified commonalities between our three clusterings using the following heuristic.Given a set of signs S, we found for each clustering the height of the smallest subtree containing every sign in S. If all of these subtrees were short (which we took to mean not larger than 2|S|) then we called S a stable cluster.
In many cases, the stable clusters comprise variants of the same sign.This is the case for M157 and M157∼a, which cluster together across all techniques and are already believed to function similarly to each other, if not identically.
One very large stable cluster consists of the signs M057, M066, M096, M218, and M371.This cluster is shown as it appears in each clustering in Figure 2.These signs belong to Meriggi's proposed syllabary (Meriggi 1971, esp. pp. 173-4) and are hypothesized to represent names syllabically (or logographic-syllabically; Desset 2016:83).Desset (2016:83) likewise identified "approximately 200 different signs" from possible anthroponyms, "among which M4, M9, M66, M96, M218 and M371 must be noticed for their high frequency."Desset's list differs from our cluster by only two signs, replacing M057 with M004 and M009.M004 and M009 group with other members of the putative syllabary in each clustering, but their position is much more variable across the three techniques.For M009 at least, this may indicate multivalent use: besides its inclusion in hypothesised PNs (e.g.Meriggi 1971:173;Dahl 2019:85), it appears in various different administrative contexts that don't appear to include PNs (e.g P008206) and as an account postscript (see below here and 5.3).
All three methods group the five signs in our cluster close to other suspected syllabic signs; however, since each technique groups them with a different subset of the syllabary, only these five form a stable group across all three methods.This may be due simply to their frequency, or they could in fact form a distinct subgroup within the proposed syllabary; future work may yield a better understanding of possible anthroponyms by trying to identify other such subgroups.
While this discussion has focused on the stable clusters for which we can provide some interpretation, others represent groups of signs with no previously recognised relationship, such as M003∼b and M263∼a (Figure 3).M003(∼a/b) are "stick" signs ( , ) understood in some PE contexts to denote worker categories (Dahl et al., 2018); they are graphically comparable to protocuneiform PAP∼a-c ( ) and PA ( ), the latter of which can, in later Sumerian, indicate ugula, a work group foreman/administrator.M263∼a is one of a series of depictions of "vessels" ( ), this particular variant appearing in 27 texts; notably the base sign M263 appears as a possible element in PNs (Dahl, 2019:85).Interestingly, M003∼b and M263∼a only appear together in a single text (P008727), one of a closelyrelated group of short texts6 that each end in the administrative postscript M009 M003∼b or M009 M003∼c.It can also be noted that M263∼1 occurs in another text belonging to this small group.
It thus remains for future work to interpret this and the many other stable clusters resulting from our work.These additional groupings are detailed in our data exploration toolkit, along with complete dendrograms for each clustering which are too large to include in this publication.
Although we have not performed a full study of the clusterings produced when sign variants are collapsed together, a preliminary comparison suggests this is worth pursuing.For instance, a new cluster of small livestock signs arises in the neighbor-based clustering, comprising M367 ("billy-goat"), M346 ("sheep"), M006 ("ram"), and M309 (possible animal byproduct).Existing clusters, such as the stable cluster of syllabic signs, appear to remain intact, but a complete comparison of the techniques in this setting is warranted.

Sign Frequency and n-Gram Counts
Sign frequency is another useful datapoint for understanding the overall content of the corpus and for building a more nuanced understanding of sign use (Dahl, 2002;Kelley, 2018).Figure 4 shows the most common PE unigrams, bigrams, and trigrams.These counts exclude n-grams containing numeric signs or broken or unreadable signs (transcribed as X or [...]); n-grams which span the boundary between entries are also excluded.Note the sharp drop-off in frequency from the most frequent signs to the rest of the signary; in fact nearly half the attested signs (745 out of 1623) occur only once.Similar results were presented in Dahl 2002.
The n-gram counts reveal the scale at which complex sequences of information are repeated across tablets.Over 1600 strings contain at least 3 non-numeric signs.Of these, only 11 trigrams are repeated at least 5 times across the corpus; two of these end in the "grain container" sign M288 and are therefore best parsed as undeciphered bigrams followed by an object sign.Following this, 52 other trigrams are repeated three or four times across the corpus, leaving the great majority (98%) of trigrams to appear only once or twice. 7he most frequent trigram, M377∼e M347 M371 (found 17 times per Figure 4), appears in no more than about 1.5% of the texts.Even among bi-  grams, the most common can only occur in up to 3.2% of texts.
External comparisons may help determine whether this is a meaningful degree of repetition, but such comparisons are not straightforward.Third millennium Sumerian or Akkadian accounting tablets are reasonable corpora to compare against, but these are available only in transliteration (using sign readings) while PE is transcribed (using sign names).This distinction makes ngram counts from the two corpora incomparable without further work to transform the data.
Despite this, an impressionistic assessment of Ur III Sumerian administrative texts suggests that they are highly repetitious: information of wide importance to the administration (e.g.basic nouns, phrases describing administrative functions, month names, ruler names, etc.) occurs frequently.If one expects a similar pattern in the PE administrative record, our initial analysis suggests that trigrams (and perhaps bigrams) may not be a significant tactic for encoding these types of information, although unigrams might.
An n-gram analysis can also be used to be-  Repeated n-grams, anthroponymic or otherwise, become increasingly rare for n > 3.No 4-gram or 5-gram appears more than 3 times; no 6-gram appears more than twice; and no 7-gram appears more than once.This low level of repetition indicates that common frequency-based linguistic decipherment methods may be ineffective on this corpus.We can, however, identify repeated strings which are similar to one another, if not exact copies, which may lead to insights about the function of certain PE signs and sign sequences.For example, the only two 6-grams which occur multiple times in the corpus differ from one another by only a single sign: A further variant appears once in the corpus: Traditional graphotactical analysis parses the first of these strings as follows: • Institution, household, or person class: M305 • Person class: M388 • Further designations of the individual: M240 M097∼h M004 M218 Side-by-side comparison of these 6-grams raises the question of whether the third sign in each sequence (M240, M146, and M347 respectively) is yet another classifier preceding a stable PN M097∼h M004 M218, or may reflect a PN pattern in which the first element (perhaps a logogram?)can alternate.
Although there are no repeated 7-grams or 8grams, there are three pairs of 7-grams which differ by only a single sign, and one such pair of 8-grams.We hope that by exploring sign usage within such strings, future work will be able to identify new sign ordering principles and possibly reach a more controlled set of signs that may represent anthroponyms.Such a list would offer a better (if still slim) chance at linguistic decipherment.Our data exploration toolkit provides an interface for fuzzy string matching to facilitate further investigation of strings like these.

LDA Topic Model
Latent Dirichlet Allocation (LDA; Blei et al. 2003) is a topic modeling algorithm which attempts to group related words into topics and determine which topics are discussed in a given set of documents.Notably, LDA infers topical relationships solely based on rates of term co-occurrence, meaning it can run on undeciphered texts to yield information on which terms may be related.Note, however, that topics may be semantically broad, and one must be careful not to infer too much about a sign's meaning simply from its appearance in a given topic.LDA differs from the other clustering techniques we have considered in that it also provides a means for grouping tablets based on the topics they discuss, which may reveal genres or other meaningful divisions of the corpus.
We induced a 10-topic LDA model over the PE corpus.We chose a small number of topics to make the task of interpreting the model more manageable; fewer topics make for fewer sets of representative signs to analyze.Furthermore, with 10 topics the model learns topics which are mostly non-overlapping (Figure 6), meaning there are few redundant topics to sort through.We note, however, that model perplexity drops sharply above 80 topics, and topic coherence peaks around 110 topics; future work may therefore do well to investigate larger models.The following sections begin to elaborate on the topics which we can most easily interpret, although space constraints prohibit full analysis of each individual topic.Our data exploration toolkit provides additional details including information about topic stability using the stability measure introduced by Mäntylä et al. (2018).

Topic 1
The most representative signs for this topic are M376 and M056∼f.M376 has been speculated to represent either a human worker category or cattle; M056∼f is a depiction of a plow ( , comparable to the proto-cuneiform sign for plow, APIN ).This is an intriguing connection as a sign-set for bovines has not yet been identified in PE, despite the clear cultural importance of cattle suggested by PE cylinder seal depictions (Dahl, 2016).More interesting still is the fact that M376 and M056∼f never occur in the same text.Their inclusion in the same topic implies that they simply occur in the presence of similar signs (though not as direct neighbors of those signs, since they do not group together in the neighbor-based clustering).Topic modelling in this case has brought to light tendencies in the writing system that may have been intuitively grasped but would be difficult to quantify manually.

Topic 3
The signs M297∼b and M297 are both highly representative of this topic.This is interesting as the relationship between these two signs has been uncertain (Meriggi, 1971:74).M297∼b was hypothesised to indicate a "keg" by Friberg (1978).It is an "object" sign that almost always appears in the ultimate or penultimate position of sign strings; it sometimes appears in the summary line of accounts followed by numerical notations that quantify amounts of grain or liquids.Friberg suspected such texts referred to ale distributions.Ale is thought to have been a staple of the PE diet at Susa.Meriggi suggested M297 may indicate "bread", but he also included it in his syllabary; it is the 6th most common sign in PE, appearing in 145 texts, and M297∼b is the 31st most common appearing in 66 texts.Yet topic 3 is the dominant topic in only 85 texts, suggesting that the LDA model has identified a particular subset of the accounts that refer to M297 or M297∼b.Also of note is the fact that M297∼b occurs in topic 3 at a significantly higher rate than M297, despite being rarer in general-a much higher percentage of the overall uses of M297∼b appear in this topic (around 75%) than do the overall uses of M297 (less than 15%).

Topics 4 and 7
The texts included in topics 4 and 7 successfully reproduce aspects of Dahl 2005 with reference to the genres of PE livestock husbandry and slaughter texts.Dahl was able to decipher the ideographic meaning (if not phonetic realiza-tion) of signs for female, male, young, and mature sheep and goats and some of their products, beginning with the key observation that protocunieform UDU ( , "mixed sheep and goats") is graphically comparable to M346 ( ).The most representative signs in topic 4 are M346 ("ewe") and M367 ("billy-goat").
While almost every instance of M346 is representative of topic 4, it is assigned to topic 5 in the atypical text P272825 (see 5.4).Several other typical livestock context uses of M346 belong to topic 7. Topic 7 was the most stable topic across 30 repeated runs in our topic stability evaluation.The most predictive sign for this topic is M009 ( ), which is also representative of topic 4 (and appeared in Section 4.1.1).The most representative texts in this topic include a few nanny-goat herding texts; many more texts in this topic have no known association with livestock or animal products, though a few (e.g.P009141 and P008407) do bear seal impressions depicting livestock.

Topic 5
The reason that the LDA model groups these 144 texts is not immediately apparent to the traditional PE specialist.An odd feature of the topic is that M388 ("person/man") is considered the most representative sign, but the most representative text is a simple tally of equids that never uses M388, and in fact uses few non-numerical signs overall.This may be due simply to noise in the model: M388 may be a kind of "stopword" which crops up in unrelated topics due to its high frequency.That said, an intriguing feature is that a significantly larger proportion of the texts in this topic bear a seal impression than do texts in the other topics.Seal impressions are unknown to the LDA model, and their presence suggests that it is at least possible the model has identified similarities in tablet content not easily observed through traditional analysis.The atypical "elite redistributive account" (Kelley, 2018:163) P272825, which is also sealed, is associated with this topic.This text has around 116 entries using complex sign-strings, fifteen of which include M388.

Topic 6
The ten most representative signs for topic 6 include the five of Meriggi's possible syllabic signs that grouped most stably in our clustering evaluation (see 4.1.1).Nine of the ten are also included in Meriggi's syllabary, excluding only M388, the second most representative sign in the topic.M388 has been key to the identification of possible PNs, since it tends to appear just before longer sign strings and, through a series of arguments drawing on cuneiform parallels, may function as a Personenkeil (a marker for human names; Damerow and Englund 1989;Kelley 2018:222 ff.).The texts of topic 6 are of diverse size and structure, but do tend to include many traditionally identifiable PNs.

Topic 10
This topic also confirms existing understanding of a PE administrative genre, namely that of "labor administration" (Damerow and Englund, 1989;Nissen et al., 1994).The most representative signs are the characteristic "worker category signs" described in the very long ration texts discussed by Dahl et al. (2018:24-23), and indeed all of those texts appear in this topic, in addition to a variety of other identifiable labor texts of somewhat different (but partially overlapping) content.

Remaining Topics (2, 8, and 9)
Initial assessments also suggest promising avenues of analysis for topics 2, 8, and 9. Topic 2 is heavily skewed towards M288 ("grain container"), the most common PE sign;9 its third most representative sign (M391, possibly meaning "field") may suggest an agricultural management context for some texts in this topic.Topic 8 is strongly represented by |M195+M057|.This is an undeciphered complex grapheme, frequently occurring as a text's second sign after the "header" M157.In topic 9, the two most representative signs are M387 and M036 (possibly associated with rationing).Since the LDA model is not aware of the numeric notation between entries, it is interesting that the bisexagesimal numeric systems B# and B appear prominently in this topic, whether or not M036 (associated with those systems) appears: see particularly P009048 (the text most strongly associated with this topic) and P008619.

LDA Summary
The preceding sections confirm that the LDA model largely learns topics which traditional PE specialists recognise as meaningful.Our brief interpretations of the topics serve only to highlight the amount of potentially fruitful analysis that still remains to be done.It also remains to see what topics arise when sign variants are collapsed together: preliminary results suggest that topics resembling our topic 6 and topic 10 are still found, but new topics also appear which have no clear correlates in the model discussed in this paper.
6 Related Work Meriggi (1971:173-174) conducted manual graphotactic analysis of PE (and later linear Elamite) texts, for example by noting the positions in which certain signs could appear in sign-strings.Dahl (2002) was the first to use basic computer-assisted data sorting to present information on sign frequencies, and Englund (2004:129-138) concluded his discussion of "the state of decipherment" by suggesting that the newly transliterated corpus would benefit from more intensive study of sign ordering phenomena.Apart from the use of Rapidminer 10 to perform simple data sorting in Kelley 2018, no publications have yet described any effort to apply computational approaches to the dataset.
Computational approaches to decipherment (Knight and Yamada, 1999;Knight et al., 2006), which resemble the setup typically followed by human archaeological-decipherment experts (Robinson, 2009), have been useful in several real world tasks.Snyder et al. (2010) propose an automatic decipherment technique that further improves existing methods by incorporating cognate identification and lexicon induction.When applied to Ugaritic, the model is able to correctly map 29 of 30 letters to their Hebrew counterparts.Reddy and Knight (2011) study the Voynich manuscript for its linguistic properties, and show that the letter sequences are generally more predictable than in natural languages.Following this, Hauer and Kondrak (2016) treat the text in the Voynich manuscript as anagrammed substitution ciphers, and their experiments suggest, arguably, that Hebrew is the language of the document.Hierarchical clustering has previously been used by Knight et al. (2011) to aid in the decipherment of the Copiale cipher, where it was able to identify meaningful groups such as word boundary markers as well as signs which correspond to the same 10 https://www.rapidminer.com/plaintext symbol.Homburg and Chiarcos (2016) report preliminary results on automatic word segmentation for Akkadian cuneiform using rule-based, dictionary based, and data-driven statistical techniques.Pagé-Perron et al. (2017) furnish an analysis of Sumerian text including morphology, parts-ofspeech (POS) tagging, syntactic parsing, and machine translation using a parallel corpus.Although Sumerian and Akkadian are both geographically and chronologically close to PE, these corpora are very large (e.g.1.5 million lines for Sumerian), and are presented in word level transliterations rather than sign-by-sign transcriptions.This makes most of these techniques inapplicable to PE.Our study is more similar in spirit to Reddy and Knight (2011), as the Voynich manuscript and PE are both undeciphered and resource-poor, making analysis especially difficult.

Conclusions
We have shown that methods from computational linguistics can offer valuable insights into the proto-Elamite script, and can substantially improve the toolkit available to the PE specialist.Hierarchical sign clustering replicates previous work by rediscovering meaningful groups of signs, and suggests avenues for future work by revealing similarities between yet-undeciphered signs.Analysis of n-gram frequencies highlights the level of repetition of sign strings across the corpus as a point of further research interest, and also reveals sets of similar strings worth examining in detail.LDA topic modelling has replicated previous work in identifying known text genres, but has also suggested new relationships between tablets which can be explored using more traditional analysis.The methods we have used are by no means exhaustive, and there remain many more approaches to consider in future work.Particularly in a field populated by a small handful of researchers, the faster data processing and ease of visualization offered by computational methods may significantly aid progress towards understanding this writing system.We hope that our data exploration tools will help facilitate future discoveries, which may eventually lead to a more complete decipherment of the largest undeciphered corpus from the ancient world.

Figure 2 :
Figure 2: Detail of the (a) neighbor-based, (b) HMM, and (c) Brown clusterings showing signs possibly used in anthroponyms.M057, M066, M096, M218, and M371 are considered a stable cluster due to their proximity in all three clusterings.

Figure 3 :
Figure 3: M003∼b clusters identically with M263∼a in all three techniques.

Figure 4 :
Figure 4: The 10 most frequent PE unigrams, bigrams, and trigrams (top to bottom).In parentheses are given the frequencies of the two unigrams comprising each bigram, and the two bigrams comprising each trigram: note that some frequent n grams are comprised of relatively infrequent n − 1-grams.

Figure 5 :
Figure 5: The 10 most frequent PE bigrams and trigrams (top to bottom), limited to signs in Dahl's (2019) syllabary.In parentheses are given the frequencies of the two unigrams comprising each bigram, and the two bigrams comprising each trigram.