Neural Decipherment via Minimum-Cost Flow: From Ugaritic to Linear B

In this paper we propose a novel neural approach for automatic decipherment of lost languages. To compensate for the lack of strong supervision signal, our model design is informed by patterns in language change documented in historical linguistics. The model utilizes an expressive sequence-to-sequence model to capture character-level correspondences between cognates. To effectively train the model in unsupervised manner, we innovate the training procedure by formalizing it as a minimum-cost flow problem. When applied to decipherment of Ugaritic, we achieve 5% absolute improvement over state-of-the-art results. We also report first automatic results in deciphering Linear B, a syllabic language related to ancient Greek, where our model correctly translates 67.3% of cognates.


Introduction
Decipherment is an ultimate low-resource challenge for both humans and machines. The lack of parallel data and scarce quantities of ancient text complicate the adoption of neural methods that dominate modern machine translation. Even for human experts this translation scenario proved to be onerous: a typical decipherment spans over decades and requires encyclopedic domain knowledge, prohibitive manual effort and sheer luck (Robinson, 2002). Moreover, techniques applied for the decipherment of one lost language are rarely reusable for another language. As a result, every significant human decipherment is considered to be one of a kind, "the rarest category of achievement" (Pope, 1975).
Prior work has demonstrated the feasibility of automatic decipherment. Snyder et al. (2010) translated the ancient Semitic language Ugaritic into Hebrew. Since both languages are derived from the same proto-Semitic origin, the translation involved matching their alphabets at the character level and mapping cognates at the word level. The effectiveness of their approach stemmed from its ability to incorporate expansive linguistic knowledge, including expected morphological correspondences, the nature of alphabet-level alignment, etc. As with human decipherment, this approach is highly customized for a given language pair and does not generalize to other lost languages.
In this paper, we introduce a neural decipherment algorithm that delivers strong performances across several languages with distinct linguistic characteristics. As in prior work, our input consists of text in a lost language and a non-parallel corpus in a known related language. The model is evaluated on the accuracy of aligning words from the lost language to their counterparts in the known language.
To maintain the language-independent nature of the approach, we want to build the model around the most basic decipherment principles applicable across multiple languages. These principles are informed by known patterns in language change extensively documented in historical linguistics (Campbell, 2013). At the character level, we know that characters that originate from the same proto-language have similar distributional profiles with respect to their occurrences. Another important constraint at the character level is that cognate alignment is monotonic since character reorderings within cognate pairs are rare. At the vocabulary level, we want to enforce skewed mapping at the word level assuming roughly oneto-one correspondence. Finally, we want to ensure that the resulting vocabulary mapping covers a significant portion of the lost language vocabu-lary and can also account for the presence of words which are not cognates.
Our model captures both character-level and word-level constraints in a single generative framework wherein vocabulary level alignment is a latent variable. We model cognate generation process using a character-level sequence-tosequence model which is guided towards monotonic rewriting via regularization. Distributional similarity at the character level is achieved via universal character embeddings. We enforce constraints on the vocabulary mapping via minimumcost flow formulation that controls structural sparsity and coverage on the global cognate assignment. The two components of the model -sequence-to-sequence character alignment and flow constraints -are trained jointly using an EMstyle procedure.
We evaluate our algorithm on two lost languages -Ugaritic and Linear B. In the case of Ugaritic, we demonstrate improved performance of cognate identification, yielding 5.5% absolute improvement over previously published results (Snyder et al., 2010). This is achieved without assuming access to the morphological information in the known language.
To demonstrate the applicability of our model to other linguistic families, we also consider decipherment of Linear B, an ancient script dating back to 1450BC. Linear B exhibits a number of significant differences from Ugaritic, most noted among them its syllabic writing system. It has not been previously deciphered by automatic means. We were able to correctly translate 67.3% of Linear B cognates into their Greek equivalents in the decipherment scenario. Finally, we demonstrate that the model achieves superior performance on cognate datasets used in previous work (Berg-Kirkpatrick and Klein, 2013).

Related Work
Decoding of Ciphered Texts Early work on decipherment was primarily focused on man-made ciphers, such as substitution ciphers. Most of these approaches are based on EM algorithms which are further adjusted for target decipherment scenarios. These adjustments are informed by assumptions about ciphers used to produce the data (Knight and Yamada, 1999;Knight et al., 2006;Ravi and Knight, 2011;Pourdamghani and Knight, 2017). Besides the commonly used EM algorithm, (Nuhn et al., 2013;Hauer et al., 2014;Kambhatla et al., 2018) also tackles substitution decipherment and formulate this problem as a heuristic search procedure, with guidance provided by an external language model (LM) for candidate rescoring. So far, techniques developed for man-made ciphers have not been shown successful in deciphering archaeological data. This can be attributed to the inherent complexity associated with processes behind language evolution of related languages.
Nonparallel Machine Translation Advancements in distributed representations kindled exciting developments in this field, including translations at both the lexical and the sentence level. Lexical translation is primarily formulated as alignment of monolingual embedding spaces into a crosslingual representation using adversarial training (Conneau et al., 2017), VAE (Dou et al., 2018), CCA (Haghighi et al., 2008;Faruqui and Dyer, 2014) or mutual information (Mukherjee et al., 2018). The constructed monolingual embedding spaces are usually of high quality due to the large amount of monolingual data available. The improved quality of distributed representations has similarly strong impact on non-parallel translation systems that operate at the sentence level (Pourdamghani and Knight, 2017). In that case, access to a powerful language model can partially compensate for the lack of explicit parallel supervision. Unfortunately, these methods cannot be applied to ancient texts due to the scarcity of available data. (Snyder et al., 2010) were the first to demonstrate the feasibility of automatic decipherment of a dead language using non-parallel data. The success of their approach can be attributed to cleverly designed Bayesian model that structurally incorporated powerful linguistic constraints. This includes customized priors for alphabet matching, incorporation of morphological structure, etc. (Berg-Kirkpatrick and Klein, 2011) proposed an alternative decipherment approach based on a relatively simple model paired with sophisticated inference algorithm. While their model performed well in a noise-free scenario when matching vocabularies only contain cognates, it has not been shown successful in a full decipherment scenario. Our approach outperforms these models in both scenarios. Moreover, we have demonstrated that the same architecture deciphers two distinct ancient languages Ugaritic and Linear B. The latter result is particularly important given that Linear B is a syllabic language.

Approach
The main challenge of the decipherment task is the lack of strong supervision signal that guides standard machine translation algorithms. Therefore, the proposed architecture has to effectively utilize known patterns in language change to guide the decipherment process. These properties are summarized below: 1. Distributional Similarity of Matching Characters: Since matching characters appear in similar places in corresponding cognates, their contexts should match.
2. Monotonic Character Mapping within Cognates: Matching cognates rarely exhibit character reordering, therefore their alignment should be order preserving.
3. Structural Sparsity of Cognate Mapping: It is well-documented in historical linguistics that cognate matches are mostly one-to-one, since both words are derived from the same protoorigin.

Significant Cognate Overlap Within Related
Languages: We expect that the derived vocabulary mapping will have sufficient coverage for lost language cognates.

Generative framework
We encapsulate these basic decipherment principles into a single generative framework. Specifically, we introduce a latent variable F = {f i,j } that represents the word-level alignment between the words in the lost language X = {x i } and those in the known language Y = {y j }. More formally, we derive the joint probability by assuming a uniform prior on both Pr(F) and Pr(X |F), and i.i.d. for every y j ∈ Y. We use F to describe the set of valid values for the latent variable F, subject to the global constraints as stated in Property 3 and 4. More specifically, we utilize a minimum-cost flow setup to enforce these properties. The probability distribution Pr(y j |X , F) is further defined as where the conditional probability Pr θ (y j |x i ) is modeled by a character-based neural network parameterized by θ, which incorporates the character-level constraints as stated in Property 1 and 2.
Directly optimizing Equation (1) is infeasible since it contains a summation over all valid flows. To bypass this issue, we adopt an EMstyle iterative training regime. Specifically, the training process involves two interleaving steps. First, given the value of the flow F, the neural model is trained to optimize the likelihood function y j ∈Y Pr(y j |X , F). Next, the flow is updated by solving a minimum-cost flow problem given the trained neural model. A detailed discussion of the training process is presented in Section 4.
We now proceed to provide details on both the neural model and the minimum-flow setup.

Neural decipherment model
We use a character-based sequence-to-sequence (seq2seq) model to incorporate the local constraints ( Figure 1). Specifically, we integrate Property 1 by using a shared universal character embedding space and a residual connection. Furthermore, the property of monotonic rewriting is realized by a regularization term based on edit distance. We detail each component in the following paragraphs.
Universal character embedding We directly require that character embeddings of the two languages reside in the same space. Specifically, we assume that any character embedding in a given language is a linear combination of universal embeddings. More formally, we use a universal embedding matrix U ∈ M nu×d , a lost language character weight matrix W x ∈ M nx×nu and a known language character weight matrix W y ∈ M ny×nu . We use n u to denote the size of the universal character inventory, and n x , n y the number of unique

Lost language
Known language </s> characters in the lost and the known languages, respectively. Embedding matrices for both languages are computed by This formulation reflects the principle underlying crosslingual embeddings such as MUSE (Conneau et al., 2017). Along a similar line, previous work has demonstrated the effectiveness of using universal word embeddings, in the context of lowresource neural machine translation (Gu et al., 2018).
Residual connection Character alignment is mostly local in nature, but this fact is not reflected by how the next character is predicted by the model. Specifically, the prediction is made based on the context vectorh, which is a nonlinear function of the hidden states of the encoder and the decoder. As a result,h captures a much wider context due to the nature of a recurrent neural network.
To address this issue and directly improve the quality of character alignment, we add a residual connection from the encoder embedding layer to the decoder projection layer. Specifically, letting α be the predicted attention weights, we compute where E x (i) is the encoder character embedding at position i, and c is the weighted character embedding.ĥ is subsequently used to predict the next character. A similar strategy has also been adopted Figure 2: An example of alignment between a Linear B word and Greek word. and denote correct and wrong alignment positions respectively. The misalignment between E and ν incurs a deletion error; 1 and ζ incurs an insertion error.
by Nguyen and Chiang (2018) to refine the quality of lexical translations in NMT.
Monotonic alignment regularization We design a regularization term that guides the model towards monotonic rewriting. Specifically, we penalizes the model whenever insertions or deletions occur. More concretely, for each word in the lost language x i , we first compute the alignment probability Pr(a t i |x i ) over the input sequence at decoder time step t, predicted by the attention mechanism. Then we compute the expected alignment position as where k is any potential aligned position. The regularization term is subsequently defined as S T  Note that no loss is incurred when the current alignment position immediately follows the previous position, namely p t i = p t−1 i + 1. Furthermore, we use a quadratic loss function to discourage expensive multi-character insertions and deletions.
For Linear B, we modify this regularization term to accommodate the fact that it is a syllabic language and usually one linear B script corresponds to two Greek letters. Particularly, we use the following regularization term for Linear B Figure 2 illustrates one alignment matrix from Linear B to Greek. In this example, the Linear B character E is supposed to be aligned with Greek characters ν and ω but only got assigned to ω, hence incurring a deletion error; 1 is supposed to be only aligned to σ and o, but assigned an extra alignment to ζ, incurring an insertion error.

Minimum-cost flow
The latent variable F captures the global constraints as stated in Property 3 and 4. Specifically, F should identify a reasonable number of cognate pairs between the two languages, while meeting the requirement that word-level alignments are one-to-one. To this end, we cast the task of identifying cognate pairs as a minimum-cost flow problem ( Figure 3). More concretely, we have three sets of edges in the flow setup: • f s,i : edges from the source node to the word x i in the lost language, • f j,t : edges from the word y j in the known language to the sink node, • f i,j : edges from x i to y j .
Each edge has a capacity of 1, effectively enforcing the one-to-one constraint. Only the edges f i,j have associated costs. We define this cost as the expected distance between x i and y j : where d(·, ·) is the edit distance function, and Pr(y|x i ) is given by the neural decipherment model. We use a sampling procedure proposed by Shen et al. (2016) to compute this expected distance. To provide a reasonable coverage of the cognate pairs, we further specify the demand constraint j f j,t = D with a given hyperparameter D.
We note that the edit distance cost plays an essential role of complementing the neural model. Specifically, neural seq2seq models are notoriously inadequate at capturing insertions and deletions, contributing to many issues of overgeneration or undergeneration in NMT (Tu et al., 2016). These problems are only accentuated due to a lack of supervision. Using edit distance in the flow setup helps alleviate this issue, since a misstep of insertion or deletion by the neural model will still generate a string that resembles the ground truth in terms of edit distance. In other words, the edit distance based flow can still recover from the mistakes the neural model makes.

Training
We note that with weak supervision, a powerful neural model can produce linguistically degenerate solutions. To prevent the neural model from getting stuck at an unreasonable local minimum, we make three modifications detailed in the following paragraphs. The entire training procedure is illustrated in Alg 1.
Flow decay The flow solver returns sparse values -the flow values for the edges are mostly zero. It is likely that this will discard many true cognate pairs, and the neural model trained on these sparse values can be easily misled and get stuck at some suboptimal local minimum.
To alleviate this issue, we apply an exponential decay to the flow values, and compute an interpolation between the new flow result and the previous one. Specifically, we update the flow at iteration τ as i,j is the raw output given by the flow solver, and γ is a hyperparameter.
Norm control Recall that the residual connection combines a weighted character embedding c, and a context vectorh (Equation (3)). We observe that during training,h has a much bigger norm than c, essentially defeating the purpose of improving character alignment by using a residual connection. To address this issue, we rescaleh so that the norm ofh does not exceed a certain percentage of the norm of c. More formally, given a ratio r < 1.0, we compute the residual output aŝ Periodic reset We re-initialize the parameters of the neural model and reset the state of the optimizer after each iteration. Empirically, we found that our neural network can easily converge to a suboptimal local minimum given a poor global word-level alignment. Resetting the model parameters periodically helps with limiting the negative effect caused by such alignments.

Experiments
Datasets We evaluate our system on the following datasets: • UGARITIC: Decipherment from Ugaritic to Hebrew. Ugaritic is an ancient Semitic language closely related to Hebrew, which was used for the decipherment of Ugaritic. This dataset has been previously used for decipherment by Snyder et al. (2010).
• Linear B: Decipherment from Linear B to Greek. Linear B is a syllabic writing system used to write Mycenaean Greek dating back to around 1450BC. Decipherment of a syllabic language like Linear B is significantly harder, since it employs a much bigger inventory of symbols (70 in our corpus), and the symbols that have the same consonant or vowel look nothing alike 2 .
We extracted pairs of Linear B scripts (i.e., words) and Greek pronunciations from a compiled list of Linear B lexicon 3 . We process the data by removing some uncertain translations, eventually retaining 919 pairs in total. The linear B scripts are kept as it is, and we remove all diacritics in the Greek data.
We also consider a subset of the Greek data to simulate an actual historical event where many linear B syllabograms were deciphered by being compared with Greek location names. On the Greek side, we retain 455 proper nouns such as locations, names of Gods or Goddesses, and personal names. The entire vocabulary of the Linear B side is kept as it is. This results in a dataset with roughly 50% unpaired words on the Linear B side. We call this subset Linear B/names.
To the best of our knowledge, our experiment is the first attempt of deciphering Linear B automatically.
• ROMANCE: Cognate detection between three Romance languages. It contains phonetic transcriptions of cognates in Italian, Spanish and Portuguese. This dataset has been used by Hall and Klein (2010) and Berg-Kirkpatrick and Klein (2011).
Data statistics are summarized in Table 1.  Systems We report numbers for the following systems: • Bayesian: the Bayesian model by Snyder et al. (2010) that automatically deciphered Ugaritic to Hebrew • Matcher: the system using combinatorial optimization, proposed by Berg-Kirkpatrick and Klein (2011).
We directly quote numbers from their papers for the UGARITIC and ROMANCE datasets. To facilitate direct comparison, we follow the same data processing procedure as documented in the literature.
Training details Our neural model uses a biredictional-LSTM as the encoder and a singlelayer LSTM as the decoder. The dimensionality of character embeddings and the hidden size of LSTM are set to 250 for all experiments. The size of the universal character inventory is 50 for all datasets except Linear B for which we use 100. The hyperparameter for alignment regularization is set to 0.5, and the ratio r to control the norm of the context vector is set to 0.2. We use ADAM (Kingma and Ba, 2015) to optimize the neural model. To speed up the process of solving the minimum-cost flow problem, we sparsify the flow graph by only considering the top 5 candidates for every x i . γ = 0.9 is used for the flow decay on all datasets except on UGARITIC for which we use γ = 0.25. We use the OR-Tools optimization toolkit 4 as the flow solver. We found it beneficial to train our model only on a randomly selected subset (10%) of the entire corpus with the same percentage of noncognates, and test it on the full dataset. It is common for the dataset UGARITIC to contain several cognates for the same Ugaritic word, and we found that 4 https://github.com/google/or-tools relaxing the capacity f j,t to 3 yields a better result. For Linear B, similar to the finding by (Berg-Kirkpatrick and Klein, 2013), random restarting and choosing the best model based on the objective produces substantial improvements. In scenarios where many unpaired cognates are present, we follow Haghighi et al. (2008) to gradually increase the number of cognate pairs to identify.

Results
UGARITIC We evaluate our system in two settings. First, we test the model under the noiseless condition where only cognates pairs are included during training. This is the setting adopted by Berg-Kirkpatrick and Klein (2011). Second, we conduct experiments in the more difficult and realistic scenario where there are unpaired words in both Ugaritic and Hebrew. This is the noisy setting considered by Snyder et al. (2010). As summarized by Table 2, our system outperforms existing methods by 3.1% under the noiseless condition, and 5.5% under the noisy condition.
We note that the significant improvement under the noisy condition is achieved without assuming access to any morphological information in Hebrew. In costrast, previous system Bayesian utilized an inventory of known morphemes and complete morphological segmentations in Hebrew during training. The significant gains in identifying cognate pairs suggest that our proposed model provide a strong and viable approach towards automatic decipherment.

System
Noiseless Noisy Matcher 90.4 -Bayesian -60.4 NeuroCipher 93.5 65.9 Table 2: Cognate identification accuracy (%) for UGARITIC under noiseless and noisy conditions. The noiseless baseline result is quoted from (Berg-Kirkpatrick and Klein, 2011), and the noisy baseline result is quoted from (Snyder et al., 2010).  Linear B To illustrate the applicability of our system to other linguistic families, we evaluate the model on Linear B and Linear B/names. Table 3 shows that our system reaches high accuracy at 84.7% in the noiseless LinearB corpus, and 67.3% accuracy in the more challenging and realistic LinearB-names dataset. We note that our system is able to achieve a reasonable level of accuracy with minimal change to the system. The only significant modification is the usage of a slightly different alignment regularization term (Equation (5)). We also note that this language pair is not directly applicable to both of the previous systems Bayesian and Matcher. The flexibility of the neural decipherment model is one of the major advantages of our approach.
ROMANCE Finally, we report results for RO-MANCE (Hall and Klein, 2010) in Table 4, as further verification of the efficacy of our system. We include the average cognate detection accuracy across all language pairs as well as the accuracies for individual pairs. Note that in this experiment the dataset does not contain unpaired words. Table 4 shows that our system improves the overall accuracy by 1.5%, mostly contributed by Es It and It Pt. 5
Ablation study Finally, we investigate contribution of various components of the model architecture to the decipherment performance. Specifically, we look at the design choices directly in-System UGARITIC NeuroCipher 65.9 -monotonic 0.0 -residual 0.0 -flow 8.6 Table 5: Results for the noisy setting of UGARITIC.
-monotonic and -residual remove the monotonic alignment regularization and the residual connection, and -flow does not use flow or iterative training.
formed by patterns in language change: In all the above cases, the reduced decipherment model fails. The first two cases reach 0% accuracy, and the third one barely reaches 10%. This illustrates the utmost importance of injecting prior linguistic knowledge into the design of modeling and training, for the success of decipherment.

Conclusions
We proposed a novel neural decipherment approach. We design the model and training procedure following fundamental principles of decipherment from historical linguistics, which effectively guide the decipherment process without supervision signal. We use a neural sequence-tosequence model to capture character-level cognate generation process, for which the training procedure is formulated as flow to impose vocabularylevel structural sparsity. We evaluate our approach on two lost languages, Ugaritic and Linear B, from different linguistic families, and observed substantially high accuracy in cognate identification. Our approach also demonstrated significant improvement over existing work on Romance languages.