Phonetic and Visual Priors for Decipherment of Informal Romanization

Informal romanization is an idiosyncratic process used by humans in informal digital communication to encode non-Latin script languages into Latin character sets found on common keyboards. Character substitution choices differ between users but have been shown to be governed by the same main principles observed across a variety of languages—namely, character pairs are often associated through phonetic or visual similarity. We propose a noisy-channel WFST cascade model for deciphering the original non-Latin script from observed romanized text in an unsupervised fashion. We train our model directly on romanized data from two languages: Egyptian Arabic and Russian. We demonstrate that adding inductive bias through phonetic and visual priors on character mappings substantially improves the model’s performance on both languages, yielding results much closer to the supervised skyline. Finally, we introduce a new dataset of romanized Russian, collected from a Russian social network website and partially annotated for our experiments.


Introduction
Written online communication poses a number of challenges for natural language processing systems, including the presence of neologisms, codeswitching, and the use of non-standard orthography. One notable example of orthographic variation in social media is informal romanization 2speakers of languages written in non-Latin alphabets encoding their messages in Latin characters, for convenience or due to technical constraints (improper rendering of native script or keyboard 1 The code and data are available at https://github.  [horošo, 'good'] (middle) based on phonetic (top) and visual (bottom) similarity, with character alignments displayed. The phoneticvisual dichotomy gives rise to one-to-many mappings such as x /S/ → sh / w. layout incompatibility). An example of such a sentence can be found in Figure 2. Unlike named entity transliteration where the change of script represents the change of language, here Latin characters serve as an intermediate symbolic representation to be decoded by another speaker of the same source language, calling for a completely different transliteration mechanism: instead of expressing the pronunciation of the word according to the phonetic rules of another language, informal transliteration can be viewed as a substitution cipher, where each source character is replaced with a similar Latin character.
In this paper, we focus on decoding informally romanized texts back into their original scripts. We view the task as a decipherment problem and propose an unsupervised approach, which allows us to save annotation effort since parallel data for informal transliteration does not occur naturally. We propose a weighted finite-state transducer (WFST) cascade model that learns to decode informal romanization without parallel text, relying only on transliterated data and a language model over the original orthography. We test it on two languages, Egyptian Arabic and Russian, collecting our own dataset of romanized Russian from a Russian social network website vk.com. Figure 2: Example of an informally romanized sentence from the dataset presented in this paper, containing a many-to-one mapping / x → w. Scientific transliteration, broad phonetic transcription, and translation are not included in the dataset and are presented for illustration only.
Since informal transliteration is not standardized, converting romanized text back to its original orthography requires reasoning about the specific user's transliteration preferences and handling many-to-one ( Figure 2) and one-to-many ( Figure 1) character mappings, which is beyond traditional rule-based converters. Although user behaviors vary, there are two dominant patterns in informal romanization that have been observed independently across different languages, such as Russian (Paulsen, 2014), dialectal Arabic (Darwish, 2014) or Greek (Chalamandaris et al., 2006): Phonetic similarity: Users represent source characters with Latin characters or digraphs associated with similar phonemes (e.g. m /m/ → m, l /l/ → l in Figure 2). This substitution method requires implicitly tying the Latin characters to a phonetic system of an intermediate language (typically, English).
Visual similarity: Users replace source characters with similar-looking symbols (e.g. q / > tS j / → 4, u /u/ → y in Figure 2). Visual similarity choices often involve numerals, especially when the corresponding source language phoneme has no English equivalent (e.g. Arabic /Q/ → 3).
Taking that consistency across languages into account, we show that incorporating these style patterns into our model as priors on the emission parameters-also constructed from naturally occurring resources-improves the decoding accuracy on both languages. We compare the proposed unsupervised WFST model with a supervised WFST, an unsupervised neural architecture, and commercial systems for decoding romanized Russian (translit) and Arabic (Arabizi). Our unsupervised WFST outperforms the unsupervised neural baseline on both languages.

Related work
Prior work on informal transliteration uses supervised approaches with character substitution rules either manually defined or learned from automatically extracted character alignments (Darwish, 2014;Chalamandaris et al., 2004). Typically, such approaches are pipelined: they produce candidate transliterations and rerank them using modules encoding knowledge of the source language, such as morphological analyzers or wordlevel language models . Supervised finite-state approaches have also been explored (Wolf-Sonkin et al., 2019;Hellsten et al., 2017); these WFST cascade models are similar to the one we propose, but they encode a different set of assumptions about the transliteration process due to being designed for abugida scripts (using consonant-vowel syllables as units) rather than alphabets. To our knowledge, there is no prior unsupervised work on this problem. Named entity transliteration, a task closely related to ours, is better explored, but there is little unsupervised work on this task as well. In particular, Ravi and Knight (2009) propose a fully unsupervised version of the WFST approach introduced by Knight and Graehl (1998), reframing the task as a decipherment problem and learning cross-lingual phoneme mappings from monolingual data. We take a similar path, although it should be noted that named entity transliteration methods cannot be straightforwardly adapted to our task due to the different nature of the transliteration choices. The goal of the standard transliteration task is to communicate the pronunciation of a sequence in the source language (SL) to a speaker of the target language (TL) by rendering it appropriately in the TL alphabet; in contrast, informal romanization emerges in communication between SL speakers only, and TL is not specified. If we picked any specific Latinscript language to represent TL (e.g. English, which is often used to ground phonetic substitutions), many of the informally romanized sequences would still not conform to its pronunciation rules: the transliteration process is characterlevel rather than phoneme-level and does not take possible TL digraphs into account (e.g. Russian sh /sx/ → sh), and it often involves eclectic visual substitution choices such as numerals or punctua-tion (e.g. Arabic [tHt, 'under'] 3 → ta7t, Russian dl [dlja, 'for'] → dl9| ). Finally, another relevant task is translating between closely related languages, possibly written in different scripts. An approach similar to ours is proposed by Pourdamghani and Knight (2017). They also take an unsupervised decipherment approach: the cipher model, parameterized as a WFST, is trained to encode the source language character sequences into the target language alphabet as part of a character-level noisy-channel model, and at decoding time it is composed with a word-level language model of the source language. Recently, the unsupervised neural architectures (Lample et al., 2018(Lample et al., , 2019 have also been used for related language translation and similar decipherment tasks (He et al., 2020), and we extend one of these neural models to our characterlevel setup to serve as a baseline ( §5).

Methods
We train a character-based noisy-channel model that transforms a character sequence o in the native alphabet of the language into a sequence of Latin characters l, and use it to decode the romanized sequence l back into the original orthography. Our proposed model is composed of separate transition and emission components as discussed in §3.1, similarly to an HMM. However, an HMM assumes a one-to-one alignment between the characters of the observed and the latent sequences, which is not true for our task. One original script character can be aligned to two consecutive Latin characters or vice versa: for example, when a phoneme is represented with a single symbol on one side but with a digraph on the other (Figure 1), or when a character is omitted on one side but explicitly written on the other (e.g. short vowels not written in unvocalized Arabic but written in transliteration, or the Russian soft sign representing palatalization being often omitted in the romanized version). To handle those alignments, we introduce insertions and deletions into the emission model and modify the emission transducer to limit the number of consecutive insertions and deletions. In our experiments, we compare the performance of the model with and without informative phonetic and visual similarity priors described in §3.2.

Model
If we view the process of romanization as encoding a source sequence o into Latin characters, we can consider each observation l to have originated via o being generated from a distribution p(o) and then transformed to Latin script according to another distribution p(l|o). We can write the probability of the observed Latin sequence as: The first two terms in (1) correspond to the probabilities under the transition model (the language model trained on the original orthography) and the emission model respectively. The third term represents the prior distribution on the emission model parameters through which we introduce human knowledge into the model. Our goal is to learn the parameters θ of the emission distribution with the transition parameters γ being fixed.
We parameterize the emission and transition distributions as weighted finite-state transducers (WFSTs): Transition WFSA The n-gram weighted finitestate acceptor (WFSA) T represents a characterlevel n-gram language model of the language in the native script, producing the native alphabet character sequence o with the probability p(o; γ). We use the parameterization of Allauzen et al. (2003), with the states encoding conditioning history, arcs weighted by n-gram probabilities, and failure transitions representing backoffs. The role of T is to inform the model of what well-formed text in the original orthography looks like; its parameters γ are learned from a separate corpus and kept fixed during the rest of the training.
Emission WFST The emission WFST S transduces the original script sequence o to a Latin sequence l with the probability p(l|o; θ). Since there can be multiple paths through S that correspond to the input-output pair (o, l), this probability is summed over all such paths (i.e. is a marginal over all possible monotonic character alignments): We view each path e as a sequence of edit operations: substitutions of original characters with Latin ones (c o → c l ), insertions of Latin characters ( → c l ), and deletions of original characters (c o → ). Each arc in S corresponds to one of the possible edit operations; an arc representing the edit c o → c l is characterized by the input label c o , the output label c l , and the weight − log p(c l |c o ; θ). The emission parameters θ are the multinomial conditional probabilities of the edit operations p(c l |c o ); we learn θ using the algorithm described in §3.3.

Phonetic and visual priors
To inform the model of which pairs of symbols are close in the phonetic or visual space, we introduce the priors on the emission parameters, increasing the probability of an original alphabet character being substituted by a similar Latin one. Rather than attempting to operationalize the notions of phonetic or visual similarity, we choose to read the likely mappings between symbols off humancompiled resources that use the same underlying principle: phonetic keyboard layouts and visually confusable symbol lists. Examples of mappings that we encode as priors can be found in Table 1.
Phonetic similarity Since we think of the informal romanization as a cipher, we aim to capture the phonetic similarity between characters based on association rather than on the actual graphemeto-phoneme mappings in specific words. We approximate it using phonetic keyboard layouts, oneto-one mappings built to bring together "similarsounding" characters in different alphabets. We take the character pairs from a union of multiple layouts for each language, two for Arabic 4 and four for Russian. 5 The main drawback of using keyboard layouts is that they require every character to have a Latin counterpart, so some mappings will inevitably be arbitrary; we compensate for this effect by averaging over several layouts.

Visual similarity
The strongest example of visual character similarity would be homoglyphssymbols from different alphabets represented by the same glyph, such as Cyrillic a and Latin a.
The fact that homoglyph pairs can be made indistinguishable in certain fonts has been exploited in phishing attacks, e.g. when Latin characters are replaced by virtually identical Cyrillic ones (Gabrilovich and Gontmakher, 2002). This led the Unicode Consortium to publish a list of symbols and symbol combinations similar enough to be po-  tentially confusing to the human eye (referred to as confusables). 6 This list contains not only exact homoglyphs but also strongly homoglyphic pairs such as Cyrillic and Latin lO. We construct a visual prior for the Russian model from all Cyrillic-Latin symbol pairs in the Unicode confusables list. 7 Although this list does not cover more complex visual associations used in informal romanization, such as partial similarity (Arabic Alif with Hamza → 2 due to Hamza resembling an inverted 2) or similarity conditioned on a transformation such as reflection (Russian l → v), it makes a sensible starting point. However, this restrictive definition of visual similarity does not allow us to create a visual prior for Arabic-the two scripts are dissimilar enough that the confusables list does not contain any Arabic-Latin character pairs. Proposing a more nuanced definition of visual similarity for Arabic and the associated prior is left for future work.
We incorporate these mappings into the model as Dirichlet priors on the emission parameters: θ ∼ Dir(α), where each dimension of the parameter α corresponds to a character pair (c o , c l ), and the corresponding element of α is set to the number of times these symbols are mapped to each other in the predefined mapping set.

Learning
We learn the emission WFST parameters in an unsupervised fashion, observing only the Latin side of the training instances. The marginal likelihood of a romanized sequence l can be computed by Figure 3: Schematic of the emission WFST with limited delay (here, up to 2) with states labeled by their delay values. * o and * l represent an arbitrary original or Latin symbol respectively. Weights of the arcs are omitted for clarity; weights with the same inputoutput label pairs are tied.
summing over the weights of all paths through a lattice obtained by composing T • S • A(l).
Here A(l) is an unweighted acceptor of l, which, when composed with a lattice, constrains all paths through the lattice to produce l as the output sequence. The expectation-maximization (EM) algorithm is commonly used to maximize marginal likelihood; however, the size of the lattice would make the computation prohibitively slow. We combine online learning (Liang and Klein, 2009) and curriculum learning (Bengio et al., 2009) to achieve faster convergence, as described in §3.3.1.

Unsupervised learning
We use a version of the stepwise EM algorithm described by Liang and Klein (2009), reminiscent of the stochastic gradient descent in the space of the sufficient statistics. Training data is split into mini-batches, and after processing each minibatch we update the overall vector of the sufficient statistics µ and re-estimate the parameters based on the updated vector. The update is performed by interpolating between the current value of the overall vector and the vector of sufficient statistics s k collected from the k-th mini-batch: The stepsize is gradually decreased, causing the model to make smaller changes to the parameters as the learning stabilizes. Following Liang and Klein (2009), we set it to η k = (k + 2) −β . However, if the mini-batch contains long sequences, summing over all paths in the corresponding lattices could still take a long time. As we know, the character substitutions are not arbitrary: each original alphabet symbols is likely to be mapped to only a few Latin characters, which means that most of the paths through the lattice would have very low probabilities. We prune the improbable arcs in the emission WFST while training on batches of shorter sentences. Doing this eliminates up to 66% and up to 76% of the emission arcs for Arabic and Russian respectively.
We discourage excessive use of insertions and deletions by keeping the corresponding probabili-ties low at the early stages of training: during the first several updates, we freeze the deletion probabilities at a small initial value and disable insertions completely to keep the model locally normalized. We also iteratively increase the language model order as learning progresses. Once most of the emission WFST arcs have been pruned, we can afford to compose it with a larger language model WFST without the size of the resulting lattice rendering the computation impractical. The two steps of the EM algorithm are performed as follows: E-step At the E-step we compute the sufficient statistics for updating θ, which in our case would be the expected number of traversals of each of the emission WFST arcs. For ease of bookkeeping, we compute those expectations using finitestate methods in the expectation semiring (Eisner, 2002). Summing over all paths in the lattice is usually performed via shortest distance computation in log semiring; in the expectation semiring, we augment the weight of each arc with a basis vector, where the only non-zero element corresponds to the index of the emission edit operation associated with the arc (i.e. the input-output label pair). This way the shortest distance algorithm yields not only the marginal likelihood but also the vector of the sufficient statistics for the input sequence.
To speed up the shortest distance computation, we shrink the lattice by limiting delay of all paths through the emission WFST. Delay of a path is defined as the difference between the number of the epsilon labels on the input and output sides of the path. Figure 3 shows the schema of the emission WFST where delay is limited. Substitutions are performed without a state change, and each deletion or insertion arc transitions to the next or previous state respectively. When the first (last) state is reached, further deletions (insertions) are no longer allowed.

M-step
The M-step then corresponds to simply re-estimating θ by appropriately normalizing the obtained expected counts.

Supervised learning
We also compare the performance of our model with the same model trained in a supervised way, using the annotated portion of the data that contains parallel o and l sequences. In the supervised case we can additionally constrain the lattice with an acceptor of the original orthography sequence: However, the alignment between the symbols in o and l is still latent. To optimize this marginal likelihood we still employ the EM algorithm. As this constrained lattice is much smaller, we can run the standard EM without the modifications discussed in §3.3.1.

Decoding
Inference at test time is also performed using finite-state methods and closely resembles the Estep of the unsupervised learning: given a Latin sequence l, we construct the machine T • S • A(l) in the tropical semiring and run the shortest path algorithm to obtain the most probable pathê; the source sequenceô is read off the obtained path.

Datasets
Here we discuss the data used to train the unsupervised model. Unlike Arabizi, which has been explored in prior work due to its popularity in the modern online community, a dataset of informally romanized Russian was not available, so we collect and partially annotate our own dataset from the Russian social network vk.com.

Arabic
We use the Arabizi portion of the LDC BOLT Phase 2 SMS/Chat dataset (Bies et al., 2014;, a collection of written informal conversations in romanized Egyptian Arabic annotated with their Arabic script representation. To prevent the annotators from introducing orthographic variation inherent to dialectal Arabic, compliance with the Conventional orthography for dialectal Arabic (CODA; Habash et al., 2012) is ensured. However, the effects of some of the normalization choices (e.g. expanding frequent abbreviations) would pose difficulties to our model. To obtain a subset of the data better suited for our task, we discard any instances which are not originally romanized (5% of all data), ones where the Arabic annotation contains Latin characters (4%), or where emoji/emoticon normalization was performed (12%). The information about the splits is provided in Table 2. Most of the data is allocated to the language model training set in order to give the unsupervised model enough signal from the native script side. We choose to train the transition model on the annotations from the same corpus to make the language model specific to both the informal domain and the CODA orthography.

Russian
We collect our own dataset of romanized Russian text from a social network website vk.com, adopting an approach similar to the one described by Darwish (2014). We take a list of the 50 most frequent Russian lemmas (Lyashevskaya and Sharov, 2009), filtering out those shorter than 3 characters, and produce a set of candidate romanizations for each of them to use as queries to the vk.com API. In order to encourage diversity of romanization styles in our dataset, we generate the queries by defining all plausible visual and phonetic mappings for each Cyrillic character and applying all possible combinations of those substitutions to the underlying Russian word. We scrape public posts on the user and group pages, retaining only the information about which posts were authored by the same user, and manually go over the collected set to filter out coincidental results.
Our dataset consists of 1796 wall posts from 1681 users and communities. Since the posts are quite long on average (248 characters, longest ones up to 15K), we split them into sentences using the NLTK sentence tokenizer, with manual correction when needed. The obtained sentences are used as data points, split into training, validation and test according to the numbers in Table 2. The average length of an obtained sentence is 65 characters, which is 3 times longer than an average Arabizi sentence; we believe this is due to the different nature of the data (social media posts vs. SMS). Sentences collected from the same user are distributed across different splits so that we observe a diverse set of romanization preferences in both training and testing. Each sentence in the validation and test sets is annotated by one of the two native speaker annotators, following guidelines similar to those designed for the Arabizi BOLT data (Bies et al., 2014). For more details on the annotation guidelines and inter-annotator agreement, see Appendix A.
Since we do not have enough annotations to train the Russian language model on the same corpus, we use a separate in-domain dataset. We take a portion of the Taiga dataset (Shavrina and Shapovalova, 2017), containing 307K comments scraped from the same social network vk.com, and apply the same preprocessing steps as we did in the collection process.

Experiments
Here we discuss the experimental setup used to determine how much information relevant for our task is contained in the character similarity mappings, and how it compares to the amount of information encoded in the human annotations. We compare them by evaluating the effect of the informative priors (described in §3.2) on the performance of the unsupervised model and comparing it to the performance of the supervised model.

Methods
We compare the performance of our model trained in three different setups: unsupervised with a uniform prior on the emission parameters, unsupervised with informative phonetic and visual priors ( §3.2), and supervised. We additionally compare them to a commercial online decoding system for each language (directly encoding human knowledge about the transliteration process) and a character-level unsupervised neural machine translation architecture (encoding no assumptions about the underlying process at all).
We train the unsupervised models with the stepwise EM algorithm as described in §3.3.1, performing stochastic updates and making only one pass over the entire training set. The supervised models are trained on the validation set with five iterations of EM with a six-gram transition model. It should be noted that only a subset of the validation data is actually used in the supervised training: if the absolute value of the delay of the emission WFST paths is limited by n, we will not be able to compose a lattice for any data points where the input and output sequences differ in length by more than n (those constitute 22% of the Arabic validation data and 33% of the Russian validation data for n = 5 and n = 2 respectively). Since all of the Arabic data comes annotated, we can perform the same experiment using the full training set; surprisingly, the performance of the supervised model does not improve (see Table 3).
The online transliteration decoding systems we use are translit.net for Russian and Yamli 8 for Arabic. The Russian decoder is rule-based, but the information about what algorithm the Arabic decoder uses is not disclosed.
We take the unsupervised neural machine translation (UNMT) model of Lample et al. (2018) as the neural baseline, using the implementation from the codebase of He et al. (2020), with one important difference: since the romanization process is known to be strictly character-level, we tokenize the text into characters rather than words.
Implementation We use the OpenFst library (Allauzen et al., 2007) for the implementation of all the finite-state methods, in conjunction with the OpenGrm NGram library (Roark et al., 2012) for training the transition model specifically. We train the character-level n-gram models with Witten-Bell smoothing (Witten and Bell, 1991) of orders from two to six. Since the WFSTs encoding full higher-order models become very large (for example, the Russian six-gram model has 3M states and 13M arcs), we shrink all the models except for the bigram one using relative entropy pruning (Stolcke, 1998). However, since pruning decreases the quality of the language model, we observe most of the improvement in accuracy while training with the unpruned bigram model, and the subsequent order increases lead to relatively minor gains. Hyperparameter settings for training the transition and emission WFSTs are described in Appendix B.
We optimize the delay limit for each language separately, obtaining best results with 2 for Russian and 5 for Arabic. To approximate the mono-  Table 3: Character error rate for different experimental setups. We compare unsupervised models with and without informative priors with the supervised model (trained on validation data) and a commercial online system. We do not have a visual prior for Arabic due to the Arabic-Latin visual character similarity not being captured by the restrictive confusables list that defines the prior (see §3.2). Each supervised and unsupervised experiment is performed with 5 random restarts. *The Arabic supervised experiment result is for the model trained on the validation set; training on the 5K training set yields 0.226.
tonic word-level alignment between the original and Latin sequences, we restrict the operations on the space character to only three: insertion, deletion, and substitution with itself. We apply the same to the punctuation marks (with specialized substitutions for certain Arabic symbols, such as ? → ). This substantially reduces the number of arcs in the emission WFST, as punctuation marks make up over half of each of the alphabets.
Evaluation We use character error rate (CER) as our evaluation metric. We compute CER as the ratio of the character-level edit distance between the predicted original script sequence and the human annotation to the length of the annotation sequence in characters.

Results and analysis
The CER values for the models we compare are presented in Table 3. One trend we notice is that the error rate is lower for Russian than for Arabic in all the experiments, including the uniform prior setting, which suggests that decoding Arabizi is an inherently harder task. Some of the errors of the Arabic commercial system could be explained by the decoder predictions being plausible but not matching the CODA orthography of the reference.  Table 4: Emission probabilities learned by the supervised model (compare to Table 1). All substitutions with probability greater than 0.01 are shown.

Effect of priors
The unsupervised models without an informative prior perform poorly for either language, which means that there is not enough signal in the language model alone under the training constraints we enforce. Possibly, the algorithm could have converged to a better local optimum if we did not use the online algorithm and prune both the language model and the emission model; however, that experiment would be infeasibly slow. Incorporating a phonetic prior reduces the error rate by 0.36 and 0.44 for Arabic and Russian respectively, which provides a substantial improvement while maintaining the efficiency advantage. The visual prior for Russian appears to be slightly less helpful, improving CER by 0.29. We attribute the better performance of the model with the phonetic prior to the sparsity and restrictiveness of the visually confusable symbol mappings, or it could be due to the phonetic substitutions being more popular with users. Finally, combining the two priors for Russian leads to a slight additional improvement in accuracy over the phonetic prior only. We additionally verify that the phonetic and visual similarity-based substitutions are prominent in informal romanization by inspecting the emission parameters learned by the supervised model with a uniform prior (Table 4). We observe that: (a) the highest-probability substitutions can be explained by either phonetic or visual similarity, and (b) the external mappings we use for our priors are indeed appropriate since the supervised model recovers the same mappings in the annotated data. Figure 4 shows some of the elements of the confusion matrices for the test predictions of the best-performing unsupervised models in both languages. We see that many of the frequent errors are caused by the model failing to disambiguate between two plausible decodings of a Latin character, either mapped to it through different types of similarity ( n /n/ [phonetic] → n ← [visual] p, n [visual] → h ← [phonetic] h /x/ ), or the same one (visual 8 → 8 ← v, phonetic /h/ → h ← /è/ ); such cases could be ambiguous for humans to decode as well.

Error analysis
Other errors in Figure 4 illustrate the limitations of our parameterization and the resources we rely on. Our model does not allow one-to-many alignments, which leads to digraph interpretation errors such as /s/ + /h/ → sh ← /S/. Some artifacts of the resources our priors are based on also pollute the results: for example, the confusion between and h in Russian is explained by the Russian soft sign , which has no English phonetic equivalent, being arbitrarily mapped to the Latin x in one of the phonetic keyboard layouts.
Comparison to UNMT The unsupervised neural model trained on Russian performs only marginally worse than the unsupervised WFST model with an informative prior, demonstrating that with a sufficient amount of data the neural architecture is powerful enough to learn the character substitution rules without the need for the inductive bias. However, we cannot say the same about Arabic-with a smaller training set (see Table 2), the UNMT model is outperformed by the unsupervised WFST even without an informative prior. The main difference in the performance between the two models comes down to the trade-off between structure and power: although the neural architecture captures long-range dependencies better due to having a stronger language model, it does not provide an easy way of enforcing character-level constraints on the decoding process, which the WFST model encodes by design. As a result, we observe that while the UNMT model can recover whole words more successfully (for Russian it achieves 45.8 BLEU score, while the best-performing unsupervised WFST is at 20.4), it also tends to arbitrarily insert or repeat words in the output, which leads to higher CER.

Conclusion
This paper tackles the problem of decoding nonstandardized informal romanization used in social media into the original orthography without parallel text. We train a WFST noisy-channel model to decode romanized Egyptian Arabic and Russian to their original scripts with the stepwise EM algorithm combined with curriculum learning and demonstrate that while the unsupervised model by  Figure 4: Fragments of the confusion matrix comparing test time predictions of the best-performing unsupervised models for Arabic (left) and Russian (right) to human annotations. Each number represents the count of the corresponding substitution in the best alignment (edit distance path) between the predicted and gold sequences, summed over the test set. Rows stand for predictions, columns correspond to ground truth.
itself performs poorly, introducing an informative prior that encodes the notion of phonetic or visual character similarity brings its performance substantially closer to that of the supervised model. The informative priors used in our experiments are constructed using sets of character mappings compiled for other purposes but using the same underlying principle (phonetic keyboard layouts and the Unicode confusable symbol list). While these mappings provide a convenient way to avoid formalizing the complex notions of the phonetic and visual similarity, they are restrictive and do not capture all the diverse aspects of similarity that idiosyncratic romanization uses, so designing more suitable priors via operationalizing the concept of character similarity could be a promising direction for future work. Another research avenue that could be explored is modeling specific user preferences: since each user likely favors a certain set of character substitutions, allowing user-specific parameters could improve decoding and be useful for authorship attribution.

A Data collection and annotation
Preprocessing We generate a set of 270 candidate transliterations of 26 Russian words to use as queries. However, many of the produced combinations are highly unlikely and yield no results, and some happen to share the spelling with words in other languages (most often other Slavic languages that use Latin script, such as Polish). We scrape public posts on user and group pages, retaining only the information about which posts were authored by the same user, and manually go over the collected set to filter out coincidental results. We additionally preprocess the collected data by normalizing punctuation and removing non-ASCII characters and emoji. We also replace all substrings of the same character repeated more than twice to only two repetitions, as suggested by Darwish et al. (2012), since these repetitions are more likely to be a written expression of emotion than to be explained by the underlying Russian sentence. The same preprocessing steps are applied to the original script side of the data (the annotations and the monolingual language model training corpus) as well.
Annotation guidelines While transliterating, annotators perform orthographic normalization wherever possible, correcting typos and errors in word boundaries; grammatical errors are not corrected. Tokens that do not require transliteration (foreign words, emoticons) or ones that annotator fails to identify (proper names, badly misspelled words) are removed from the romanized sentence and not transliterated. Although it means that some of the test set sentences will not exactly represent the original romanized sequence, it will help us ensure that we are only testing our model's ability to transliterate rather than make word-byword normalization decisions.
In addition, 200 of the validation sequences are dually annotated to measure the inter-annotator agreement. We evaluate it using character error rate (CER; edit distance between the two sequences normalized by the length of the reference sequence), the same metric we use to evaluate the model's performance. In this case, since neither of the annotations is the ground truth, we compute CER in both directions and average. Despite the discrepancies caused by the annotators deleting unknown words at their discretion, average CER is only 0.014, which indicates a very high level of agreement.
B Hyperparameter settings WFST model The Witten-Bell smoothing parameter for the language model is set to 10, and the relative entropy pruning threshold is 10 −5 for the trigram model and 2 · 10 −5 for higher-order models. Unsupervised training is performed in batches of size 10 and the language model order is increased every 100 batches. While training with the bigram model, we disallow insertions and freeze all the deletion probabilities at e −100 . The EM stepsize decay rate is β = 0.9. The emission arc pruning threshold is gradually decreased from 5 to 4.5 (in the negative log probability space). We perform multiple random restarts for each experiment, initializing the emission distribution to uniform plus random noise.
UNMT baseline Our unsupervised neural baseline uses a single-layer LSTM with hidden state size 512 for both the encoder and the decoder. The embedding dimension is set to 128. For the denoising autoencoding loss, we adopt the default noise model and hyperparameters as described by Lample et al. (2018). The autoencoding loss is annealed over the first 3 epochs.
We tune the maximum training sequence length (controlling how much training data is used) and the maximum allowed decoding length by optimizing the validation set CER. In our case, the maximum output length is important because the evaluation metric penalizes the discrepancy in length between the prediction and the reference; we observe the best results when setting it to 40 characters for Arabic and 180 for Russian. At training time, we filter out sequences longer than 100 characters for either language, which constitute 1% of the available Arabic training data (both the Arabic-only LM training set and the Latin-only training set combined) but almost 70% of the Russian data. Surprisingly, the Russian model trained on the remaining 30% achieves better results than the one trained on the full data; we hypothesize that the improvement comes from having a more balanced training set, since the full data is heavily skewed towards the Cyrillic side (LM training set) otherwise (see Table 2).