Deciphering Related Languages

We present a method for translating texts between close language pairs. The method does not require parallel data, and it does not require the languages to be written in the same script. We show results for six language pairs: Afrikaans/Dutch, Bosnian/Serbian, Danish/Swedish, Macedonian/Bulgarian, Malaysian/Indonesian, and Polish/Belorussian. We report BLEU scores showing our method to outperform others that do not use parallel data.


Introduction
Statistical Natural Language Processing (NLP) tools often need large amounts of training data in order to achieve good performance. This limits the use of current NLP tools to a few resourcerich languages. Assume an incident happens in an area with a low-resource language, known as the Incident Language (IL). For a quick response, we need to build NLP tools with available data, as finding or annotating new data is expensive and time consuming. For many languages this means that we only have a small amount of often out-ofdomain parallel data (e.g. a Bible or Ubuntu manual), some monolingual data and almost no annotation such as part of speech tags.
Fortunately, many low-resource languages have one or more higher-resource, closely Related Languages (RL). Examples of such IL/RL pairs are Afrikaans/Dutch and Bosnian/Serbian. A natural idea is to use RL resources to improve the task for IL. But this requires some kind of conversion between RL and IL. Assume the required NLP capability is named entity tagging. If we can convert RL to IL, we can convert all RL training data along with annotations into IL and train the tagger for IL. Or, if we can convert IL to RL we can use the potentially existing RL named entity tagger on converted IL data and project back the tags.
Following this idea, Currey et al. (2016) use a rule-based translation system to convert Italian and Portuguese into Spanish, to improve Spanish (here, IL) language modeling, Nakov and Ng (2009) convert RL/English parallel data to IL/English where both RL and IL have Latin orthography to improve IL/English machine translation. Hana et al. (2006) use cognates to adapt Spanish resources to Brazilian Portuguese to train a part-of-speech tagger. Mann and Yarowsky (2001) use Spanish/Portuguese cognates to convert an English/Spanish lexicon to English/Portuguese. These works prove the usefulness of RL data to improve NLP for IL, but they are designed for specific tasks and IL/RL pairs.
In this paper we propose a universal method for translating texts between closely related languages. We assume that IL and RL are mostly cognates, having roughly the same word order. Our method is orthography-agnostic for alphabetic systems, and crucially, it does not need any parallel data. From now on, we talk about converting RL to IL, but the method does not distinguish between RL and IL; as mentioned above, each direction of translation can have its own potential uses.
To translate RL to IL, we train a character-based cipher model and connect it to a word-based language model. The cipher model is trained in a noisy channel model where a character language model produces IL characters and the model converts them to RL. Expectation Maximization is used to train the model parameters to maximize the likelihood of a set of RL monolingual data. At decoding time, the cipher model reads the RL text character by character in which words are separated by a special character, and produces a weighted lattice of characters representing all the possible translations for each of the input tokens.

Training
Decoding Figure 1: The process used for training the cipher model and decoding RL text to IL The word-based language model takes this lattice and produces a sequence of output words that maximize the language model score times the cipher model score. Figure 1 depicts this process. Our cipher models one-to-one, one-to-two and two-to-one character mappings. This allows us to handle cases like Cyrillic 'q' and Latin 'ch', and also subtle differences in pronunciation between RL and IL like Portuguese 'justiça' and Spanish 'justicia'. Using a character-based cipher model provides the flexibility to generate unseen words. In other words, the vocabulary is limited by the decoding LM, not the cipher model. Separation of training and decoding language models enables us to train the decoding LM on as much data as is available without worrying about training speed or memory issues. We can also transliterate out of vocabulary words by spelling out the best path produced by cipher model in case no good match is found for a token in the decoding LM.

Related Work
Previous work on translation between related languages can be categorized into three groups: Systems for specific language pairs such as Czech-Slovak (Haji et al., 2000), Turkish-Crimean Tatar (Cicekli, 2002), Irish-Scottish Gaelic (Scannell, 2006), and Indonesian-Malaysian (Larasati and Kubo, 2010). Another similar trend is translation between dialects of the same language like Arabic dialects to standard Arabic (Hitham et al., 2008;Sawaf, 2010;Salloum and Habash, 2010). Also, work has been done on translating back the Romanized version of languages like Greeklish to Greek (Chalamandaris et al., 2006) and Arabizi to Arabic (May et al., 2014). These methods cannot be applied to our problem because time and resources are limited to build a translation system for the specific language pair.
Machine learning systems that use parallel data: These methods cover a broader range of languages but require parallel text between related languages. They include character-level machine translation (Vilar et al., 2007;Tiedemann, 2009) or combination of word-level and character-level machine translation (Nakov and Tiedemann, 2012) between related languages.
Use of non-parallel data: Cognates can be extracted from monolingual data and used as a parallel lexicon (Hana et al., 2006;Mann and Yarowsky, 2001;Kondrak et al., 2003). However, our task is whole-text transformation, not just cognate extraction.
Unsupervised deciphering methods, which require no parallel data, have been used for bilingual lexicon extraction and machine translation. Word-based deciphering systems ignore sub-word similarities between related languages (Koehn and Knight, 2002;Ravi and Knight, 2011b;Nuhn et al., 2012;Dou and Knight, 2012;Ravi, 2013). Haghighi et al. (2008) and Naim and Gildea (2015) propose models that can use orthographic similarities. However, the model proposed by (Naim and Gildea, 2015) is only capable of producing a parallel lexicon and not translation. Furthermore, both systems require the languages to have the same orthography and their vocabulary is limited to what they see during training.
Character-based decipherment is the model we use for solving this problem. Character-based decipherment has been previously applied to problems like solving letter substitution ciphers (Knight et al., 2006;Ravi and Knight, 2011a) or transliterating Japanese katakana into English (Ravi and Knight, 2009), but not for translating full texts between related languages.

Translating RL to IL
We learn a character-based cipher model for translating RL to IL. At decoding, this model is combined with a word based IL language model to produce IL text from RL.

Cipher Model
Our noisy-channel cipher model converts a sequence of IL characters s 1 , ..., s n to a sequence of RL characters t 1 , ..., t m . It is a WFST composed of three components ( Figure 2): WFST1 is a one-to-one letter substitution model. For each IL character s it writes one RL character t with probability p 1 (t|s).
_,_,1 Figure 2: Part of the cipher model corresponding to reading IL character s from start state. The same pattern repeats for any IL character. After reading s, the model goes to WFST1, WFST2, or WFST3 with respective probability α(s), β(s), or γ(s). In WFST1, the model produces each RL character t with probability p 1 (t|s). In WFST2, the model produces each two RL characters t and t ′ with probability p 21 (t|s) and p 22 (t ′ |s). In WFST3, the model reads each IL character s ′ and produces each RL character t with probability p 3 (t|ss ′ ). From the last state of WFST1, WFST2, and WFST3, the model returns to the start state without reading or writing.The model has a loop on start state that reads and writes space.
WFST2 is a one-to-two letter substitution model. For each IL character s, it writes two RL characters t and t ′ with respective probabilities p 21 (t|s) and p 22 (t ′ |s).
We assume p 22 (t ′ |s) is independent of t. As a result we can estimate p(tt ′ |s) = p(t|s)p(t ′ |t, s) ≃ p 21 (t|s)p 22 (t ′ |s) as modeled in WFST2. This simplification is required to make the model practicable. Otherwise, the size of the cipher model would become cubic in the number of RL and IL characters, and combining it with a language model would make the system unfeasibly large for training.
WFST3 is a two-to-one letter substitution cipher. For each IL character s, it reads another IL character s ′ with probability 1, and then writes one RL character t with probability p 3 (t|ss ′ ). As we will discuss in Section 3.2 we train p 3 directly from p 21 and p 22 , hence the cubic number of parameters does not cause a problem.
The start state reads each IL character s and goes to WFST1, WFST2, or WFST3 with respective probability α(s), β(s), or γ(s). The last state of each component returns to start without reading or writing anything. The start state also reads and writes space with probability one.

Training the Model
The cipher model described in Section 3.1 is much more flexible than a one-to-one letter substitution cipher. A few thousand sentences of RL monolingual data is not enough to train the model as a whole, and more training data makes the process too slow to be practical. Hence, we break the full model into WFST1, WFST2, and WFST3 and train the parameters of each component, i.e. p 1 , p 21 and p 22 , and p 3 in separate steps. A final step trains the probability of moving into each of the components, i.e. α, β, and γ. Each step of the training uses EM algorithm to maximize the likelihood of 500 sentences of RL text in a noisy channel model where a fixed 5gram character based IL language model (trained on 5000 IL sentences) produces an IL text character by character and the cipher model converts RL characters into RL (top section of Figure 1).
Step one: We set α(s) = 1 and β(s) = γ(s) = 0 and train p 1IL→RL (t|s) for each IL character s and each RL character t. In parallel we reverse RL and IL and train p 1RL→IL (s|t) for each RL character t and each IL character s. We use p 1 (t|s) = 1 2 (p 1IL→RL (t|s) + p 1RL→IL (s|t)) to set WFST1 parameters in the next steps.
Step two: We set α(s) = β(s) = 1 2 and γ(s) = 0, fix p 1 and train p 21IL→RL (t|s) and p 22IL→RL (t ′ |s) for each IL character s and each pair of RL characters t and t ′ . In parallel we reverse RL and IL and train p 21RL→IL (s|t) and p 22RL→IL (s ′ |t) for each RL character t and each pair of IL characters s and s ′ .
Step three: Our cipher model has to decide after reading one IL character if it will perform a one-to-one, one-to-two or two-to-one mapping. In the first two scenarios the model has enough information to decide, but for the two-to-one mapping the model has to decide before reading the second IL character. For instance, consider converting Bosnian to Serbian. When the model reads the character "c" it has to decide between one-to-one, one-to-two and two-to-one mappings. A good decision will be two-to-one mapping because "ch" maps to q, hence the system learns a large γ for character "c" but the same γ applies to any other character that follows "c" which is not desirable.
One way to overcome this problem is to change the model to make the decision after reading two IL characters, but this will over-complicate the model. We use a simpler trick instead. We compute p 3IL→RL (t|ss ′ ) from p 21RL→IL (s|t) and p 22RL→IL (s ′ |t) using Bayes rule: (1) The estimate is based on our assumption from the previous step that p 22 (s ′ |t) is independent of s. For each RL character t we compute the empirical probability p(t) from monolingual data and p(ss ′ ) is the normalization factor. We set p 3 parameters using equation (1), but before normalizing we manually prune the probabilities. If for IL characters s and s ′ there exists no RL character t such that p 21 (s|t)p 22 (s ′ |t)p(t) > 0.01 we assume that ss ′ does not map to any RL character. Otherwise, we only keep RL characters for which p 21 (s|t)p 22 (s ′ |t)p(t) > 0.01 and then apply the normalization.
Step four: In the final step we fix p 1 , p 21 , p 22 , and p 3 to the trained values and train α(s), β(s), and γ(s) for each IL character s.

Decoding
In the decoding step we compose the cipher WFST with an IL word based language model WFST and find the best path for the input sentence in the resulting WFST (bottom section of Figure 1). If the best path has a high enough score the model outputs the corresponding IL token. Otherwise it outputs the highest scored character sequence produced by the cipher model as and OOV. In our experiments we use 1-gram and 2-gram language models trained on all the existing IL monolingual data (Table 1).
For each language, we download the monolingual data from Leipzig corpora (Goldhahn et al., 2012). The domain of the data is news, web, and Wikipedia. We consider the language with more data as RL and the one with less data as IL. Table 1 shows the size of available data for each language.
We also extract the list of alphabets for each language from Wikipedia, and collect the Universal Declaration of Human Rights (UDHR) for each IL and RL. We manually sentence align these documents and get 104 sentences and about 1.5K tokens per language. We use these documents for testing the conversion accuracy.
We tokenize and lowercase all the monolingual, parallel and UDHR data with Moses scripts. We remove all non-alphabetic characters from each text according to the alphabet extracted from Wikipedia. This includes numbers, punctuations, and rare/old characters that are not considered as official characters of the language. We keep all the accented variations of characters.

Experiments
We translate the UDHR between the related languages using the following methods: Copy: Copying the text. This is not applicable for languages with different orthography. LS: One-to-one Letter Substitution cipher. This is equivalent to using WFST1 without a decoding language model. LS+1g LM: One-to-one letter substitution cipher with a 1-gram word language model at decoding. PM+1g LM, PM+2g LM: The Proposed Method with respectively 1-gram and 2-gram word language model at decoding.
Results are reported for both directions of translation in Tables 2, and 3. For all the language pairs except Malaysian(mal) / Indonesian(ind), the proposed method is the best model with a large margin. Malaysian/Indonesian is a special case where, although the languages have a different vocabulary and a slightly different grammar, they have a common alphabet, and almost all of their cognates are exactly the same. See Figure 3 for an example. As a result the proposed method cannot learn much more than copying.    afr: alle menslike wesens word vry met gelyke --waardigheid en regte a2d: alle menslike wezens werd vrij met gelijke --waardigheid en rechte dut: alle mensen ------worden vrij en gelijk in waardigheid en rechten afr2en: all human beings are free with equal --dignity and rights a2d2en: all human beings were free with equal --dignity and straight dut2en: all people ------are free and equal in dignity and rights Figure 4: First sentence of the first article of UDHR in Afrikaans (afr), Dutch (dut) and its conversion from Afrikaans to Dutch using PM+2-gram LM (a2d), along with their translations to English.
The proposed method translates between Serbian (srb) and Bosnian (bos) almost perfectly. For other pairs, we translate between a quarter and half of the words correctly, but we get few of the higher n-grams. Figure 4 visualizes the conversion of the first sentence of the first article of UDHR from Afrikaans (afr) to Dutch (dut) using PM+2g LM (4.3 BLEU4, 36.2 BLEU1). Observe that 4 out of 10 tokens are translated correctly, close to the 36.2 BLEU1 score, and there is no 3 or 4gram match. For other tokens except "menslike" the translation is either correct but non-existent in the dutch sentence (wezens = beings, met = with) or has a meaning similar enough that can be useful in the downstream applications (werd = were v.s. worden = are, gelijke = equal(noun) v.s. gelijk = equal(adjective), rechte = straight/right v.s. rechten = rights). The token "menslike" in a2d is an OOV. The model is not able to convert "menslike" (afr) to "mensen" (dut). The language model does not accept other potential conversions and passes out "menslike" (a2d) as the best output of the cipher model.

Conclusion
In this paper we present a method for translating texts between closely related languages with potentially different orthography, without needing any parallel data. The only requirement is a few thousand lines of monolingual data for each language and a word language model for the target. Our experiments on six language pairs show the proposed method outperforms others that do not use parallel data.