Decipherment of Substitution Ciphers with Neural Language Models

Decipherment of homophonic substitution ciphers using language models is a well-studied task in NLP. Previous work in this topic scores short local spans of possible plaintext decipherments using n-gram language models. The most widely used technique is the use of beam search with n-gram language models proposed by Nuhn et al.(2013). We propose a beam search algorithm that scores the entire candidate plaintext at each step of the decipherment using a neural language model. We augment beam search with a novel rest cost estimation that exploits the prediction power of a neural language model. We compare against the state of the art n-gram based methods on many different decipherment tasks. On challenging ciphers such as the Beale cipher we provide significantly better error rates with much smaller beam sizes.


Introduction
Breaking substitution ciphers recovers the plaintext from a ciphertext that uses a 1:1 or homophonic cipher key. Previous work using pretrained language models (LMs) for decipherment use n-gram LMs (Ravi and Knight, 2011;Nuhn et al., 2013). Some methods use the Expectation-Maximization (EM) algorithm (Knight et al., 2006) while most state-of-the-art approaches for decipherment of 1:1 and homophonic substitution ciphers use beam search and rely on the clever use of n-gram LMs (Nuhn et al., 2014;Hauer et al., 2014). Neural LMs globally score the entire candidate plaintext sequence (Mikolov et al., 2010). However, using a neural LM for decipherment is not trivial because scoring the entire candidate partially deciphered plaintext is computationally challenging. We solve both of these problems in this paper and provide an improved beam search based decipherment algorithm for homophonic ciphers that exploits pre-trained neural LMs for the first time.

Decipherment Model
We use the notation from Nuhn et al. (2013). Ciphertext f N 1 = f 1 ..f i ..f N and plaintext e N 1 = e 1 ..e i ..e N consist of vocabularies f i ∈ V f and e i ∈ V e respectively. The beginning tokens in the ciphertext (f 0 ) and plaintext (e 0 ) are set to "$" denoting the beginning of a sentence. The substitutions are represented by a function φ : V f → V e such that 1:1 substitutions are bijective while homophonic substitutions are general. A cipher function φ which does not have every φ(f ) fixed is called a partial cipher function (Corlett and Penn, 2010). The number of f s that are fixed in φ is given by its cardinality. φ is called an extension of φ, if f is fixed in φ such that δ(φ (f ), φ(f )) yields true ∀f ∈ V f which are already fixed in φ where δ is Kronecker delta. Decipherment is then the task of finding the φ for which the probability of the deciphered text is maximized.
where p(.) is the language model (LM). Finding this argmax is solved using a beam search algorithm (Nuhn et al., 2013) which incrementally finds the most likely substitutions using the language model scores as the ranking.

Neural Language Model
The advantage of a neural LM is that it can be used to score the entire candidate plaintext for a hypothesized partial decipherment. In this work, we use a state of the art byte (character) level neural LM using a multiplicative LSTM (Radford et al., 2017). Consider a sequence S = w 1 , w 2 , w 3 , ..., w N . The LM score of S is SCORE(S): P (S) = P (w 1 , w 2 , w 3 , ..., w N )

Beam Search
Algorithm 1 is the beam search algorithm (Nuhn et al., 2013(Nuhn et al., , 2014 for solving substitution ciphers. It monitors all partial hypotheses in lists H s and H t based on their quality. As the search progresses, the partial hypotheses are extended, scored with SCORE and appended to H t . EXT LIMITS determines which extensions should be allowed and EXT ORDER picks the next cipher symbol for extension. The search continues after pruning: H s ← HISTOGRAM_PRUNE(H t ). We augment this algorithm by updating the SCORE function with a neural LM.

Score Estimation (SCORE)
Score estimation evaluates the quality of the partial hypotheses φ. Using the example from Nuhn et al. (2014), consider the vocabularies V e = {a, b, c, d} and V f = {A, B, C, D}, extension order (B, A, C, D), and ciphertext $ ABDDCABCDADCABDC $. Let φ = {(a, A), (b, B))} be the partial hypothesis. Then SCORE(φ) scores this hypothesized partial decipherment (only A and B are converted to plaintext) using a pre-trained language model in the hypothesized plaintext language.

Baseline
The initial rest cost estimator introduced by Nuhn et al. nuhnbeam computes the score of hypotheses only based on partially deciphered text that builds a shard of n adjacent solved symbols. As a heuristic, n-grams which still consist of unsolved cipher-symbols are assigned a trivial estimate of probability 1. An improved version of rest cost es-timation (Nuhn et al., 2014) consults lower order n-grams to score each position.

Global Rest Cost Estimation
The baseline scoring method greatly relies on local context, i.e. the estimation is strictly based on partial character sequences. Since this depends solely on the n-gram LM, the true conditional probability under Markov assumption is not modeled and, therefore, context dependency beyond the window of (n − 1) is ignored. Thus, attempting to utilize a higher amount of context can lower the probability of some tokens resulting in poor scores.
We address this issue with a new improved version of the rest cost estimator by supplementing the partial decipherment φ(f N 1 ) with predicted plaintext text symbols using our neural language model (NLM). Applying φ = {(a, A), (b, B))} to the ciphertext above, we get the following partial hypothesis: .$ We introduce a scoring function that is able to score the entire plaintext including the missing plaintext symbols. First, we sample 1 the plaintext symbols from the NLM at all locations depending on the deciphered tokens from the partial hypothesis φ such that these tokens maintain their respective positions in the sequence, and at the same time are sampled from the neural LM to fit (probabilistically) in this context. Next, we determine the probability of the entire sequence including the scores of sampled plaintext as our rest cost estimate.

NLM
In our running example, this would yield a score estimation of the partial decipherment, φ(f N 1 ) : pled plaintext symbols from the NLM. Since more terms participate in the rest cost estimation with global context, we use the plaintext LM to provide us with a better rest cost in the beam search.

Frequency Matching Heuristic
Alignment by frequency similarity (Yarowsky and Wicentowski, 2000) assumes that two forms belong to the same lemma when their relative frequency fits the expected distribution. We use this heuristic to augment the score estimation (SCORE): (3) ν(f ) is the percentage relative frequency of the ciphertext symbol f , while ν(e) is the percentage relative frequency of the plaintext token e in the plaintext language model. The closer this value to 0, the more likely it is that f is mapped to e.
Thus given a φ with the SCORE(φ), the extension φ (Algo. 1) is scored as: (4) where NEW is the score for symbols that have been newly fixed in φ while extending φ to φ . Our experimental evaluations show that the global rest cost estimator and the frequency matching heuristic contribute positively towards the accuracy of different ciphertexts.

Experimental Evaluation
We carry out 2 sets of experiments: one on letter based 1:1, and another on homophonic substitution ciphers. We report Symbol Error Rate (SER) which is the fraction of characters in the deciphered text that are incorrect.
The character NLM uses a single layer multiplicative LSTM (mLSTM) (Radford et al., 2017) with 4096 units. The model was trained for a single epoch on mini-batches of 128 subsequences of length 256 for a total of 1 million weight updates. States were initialized to zero at the beginning of each data shard and persisted across updates to simulate full-backprop and allow for the forward propagation of information outside of a given sub-sequence. In all the experiments we use a character NLM trained on English Gigaword corpus augmented with a short corpus of plaintext letters of about 2000 words authored by the Zodiac killer 2 .

1:1 Substitution Ciphers
In this experiment we use a synthetic 1:1 letter substitution cipher dataset following Ravi and Knight (2008), Nuhn et al. (2013) and Hauer et al. (2014). The text is from English Wikipedia articles about history 3 , preprocessed by stripping the text of all images, tables, then lower-casing all characters, and removing all non-alphabetic and non-space characters. We create 50 cryptograms for each length 16, 32, 64, 128 and 256 using a random Caesar-cipher 1:1 substitution.

An Easy Cipher: Zodiac-408
Zodiac-408, a homophonic cipher, is commonly used to evaluate decipherment algorithms.  Our neural LM model with global rest cost estimation and frequency matching heuristic with a beam size of 1M has SER of 1.2% compared to the beam search algorithm (Nuhn et al., 2013) with beam size of 10M with a 6-gram LM which gives an SER of 2%. The improved beam search (Nuhn et al., 2014) with an 8-gram LM, however, gets 52 out of 54 mappings correct on the Zodiac-408 cipher.

A Hard Cipher: Beale Pt 2
Part 2 of the Beale Cipher is a more challenging homophonic cipher because of a much larger search space of solutions. Nunh et al. (2014) were the first to automatically decipher this Beale Cipher.
With an error of 5% with beam size of 1M vs 5.4% with 8-gram LM and a pruning size of 10M, our system outperforms the state of the art (Nuhn et al., 2014) on this task.

Related Work
Automatic decipherment for substitution ciphers started with dictionary attacks (Hart, 1994;Jakobsen, 1995;Olson, 2007). Ravi and Knight (2008) frame the decipherment problem as an integer linear programming (ILP) problem. Knight et al. (2006) use an HMM-based EM algorithm for solving a variety of decipherment problems. Ravi and Knight (2011) extend the HMM-based EM approach with a Bayesian approach, and report the first automatic decipherment of the Zodiac-408 cipher. Berg-Kirkpatrick and Klein (2013) show that a large number of random restarts can help the EM approach. Corlett and Penn (2010) presented an efficient A* search algorithm to solve letter substitution ciphers. Nuhn et al. (2013) produce better results in faster time compared to ILP and EM-based decipherment methods by employing a higher order language model and an iterative beam search algorithm. Nuhn et al. (2014) present various improvements to the beam search algorithm in Nuhn et al. (2013) including improved rest cost estimation and an optimized strategy for ordering decipherment of the cipher symbols. Hauer et al. (2014) propose a novel approach for solving mono-alphabetic substitution ciphers which combines character-level and word-level language model. They formulate decipherment as a tree search problem, and use Monte Carlo Tree Search (MCTS) as an alternative to beam search. Their approach is the best for short ciphers. Greydanus (2017) frames the decryption process as a sequence-to-sequence translation task and uses a deep LSTM-based model to learn the decryption algorithms for three polyalphabetic ciphers including the Enigma cipher. However, this approach needs supervision compared to our approach which uses a pre-trained neural LM. Gomez et al. (2018) (CipherGAN) use a generative adversarial network to learn the mapping between the learned letter embedding distributions in the ciphertext and plaintext. They apply this approach to shift ciphers (including Vigenère ciphers). Their approach cannot be extended to homophonic ciphers and full message neural LMs as in our work.

Conclusion
This paper presents, to our knowledge, the first application of large pre-trained neural LMs to the decipherment problem. We modify the beam search algorithm for decipherment from Nuhn et al. (2013; and extend it to use global scoring of the plaintext message using neural LMs. To enable full plaintext scoring we use the neural LM to sample plaintext characters which reduces the beam size required. For challenging ciphers such as Beale Pt 2 we obtain lower error rates with smaller beam sizes when compared to the state of the art in decipherment for such ciphers.