Orthographic Syllable as basic unit for SMT between Related Languages

We explore the use of the orthographic syllable, a variable-length consonant-vowel sequence, as a basic unit of translation between related languages which use abugida or alphabetic scripts. We show that orthographic syllable level translation significantly outperforms models trained over other basic units (word, morpheme and character) when training over small parallel corpora.


Introduction
Related languages exhibit lexical and structural similarities on account of sharing a common ancestry (Indo-Aryan, Slavic languages) or being in prolonged contact for a long period of time (Indian subcontinent, Standard Average European linguistic areas) (Bhattacharyya et al., 2016). Translation between related languages is an important requirement due to substantial government, business and social communication among people speaking these languages. However, most of these languages have few parallel corpora resources, an important requirement for building good quality SMT systems.
Modelling the lexical similarity among related languages is the key to building good-quality SMT systems with limited parallel corpora. Lexical similarity implies that the languages share many words with the similar form (spelling/pronunciation) and meaning e.g. blindness is andhapana in Hindi, aandhaLepaNaa in Marathi. These words could be cognates, lateral borrowings or loan words from other languages. Translation for such words can be achieved by sub-word level transformations. For instance, lexical similarity can be modelled in the standard SMT pipeline by transliteration of words while decoding (Durrani et al., 2010) or post-processing (Nakov and Tiedemann, 2012;Kunchukuttan et al., 2014).
A different paradigm is to drop the notion of word boundary and consider the character n-gram as the basic unit of translation (Vilar et al., 2007;Tiedemann, 2009a). Such character-level SMT bas been explored for closely related languages like Bulgarian-Macedonian, Indonesian-Malay with modest success, with the short context of unigrams being a limiting factor (Tiedemann, 2012). The use of character n-gram units to address this limitation leads to data sparsity for higher order n-grams and provides little benefit (Tiedemann and Nakov, 2013).
In this work, we present a linguistically motivated, variable length unit of translationorthographic syllable (OS) -which provides more context for translation while limiting the number of basic units. The OS consists of one or more consonants followed by a vowel and is inspired from the akshara, a consonant-vowel unit, which is the fundamental organizing principle of Indic scripts (Sproat, 2003;Singh, 2006). It can be thought of as an approximate syllable with the onset and nucleus, but no coda. While true syllabification is hard, orthographic syllabification can be easily done. Atreya et al. (2016) and Ekbal et al. (2006) have shown that the OS is a useful unit for transliteration involving Indian languages.
We show that orthographic syllable-level trans-lation significantly outperforms character-level and strong word-level and morpheme-level baselines over multiple related language pairs (Indian as well as others). Character-level approaches have been previously shown to work well for language pairs with high lexical similarity. Our major finding is that OS-level translation outperforms other approaches even when the language pairs have relatively less lexical similarity or belong to different language families (but have sufficient contact relation).

Orthographic Syllabification
The orthographic syllable is a sequence of one or more consonants followed by a vowel, i.e a C + V unit. We describe briefly procedures for orthographic syllabification of Indian scripts and non-Indic alphabetic scripts. Orthographic syllabification cannot be done for languages using logographic and abjad scripts as these scripts do not have vowels.
Indic Scripts: Indic scripts are abugida scripts, consisting of consonant-vowel sequences, with a consonant core (C + ) and a dependent vowel (matra). If no vowel follows a consonant, an implicit schwa vowel [IPA: ə] is assumed. Suppression of schwa is indicated by the halanta character following a consonant. This script design makes for a straightforward syllabification process as shown in the following example. e.g. लक्षमी ( lakShamI . There are two exceptions to this scheme: (i) Indic scripts distinguish between dependent vowels (vowel diacritics) and independent vowels, and the latter will constitute an OS on its own. e.g. मु म्बई (mumbaI) → मु म्ब ई (mu mba I) (ii) The characters anusvaara and chandrabindu are part of the OS to the left if they represents nasalization of the vowel/consonant or start a new OS if they represent a nasal consonant. Their exact role is determined by the character following the anusvaara. See Appendix A for details.

Non-Indic Alphabetic Scripts:
We use a simpler method for the alphabetic scripts used in our experiments (Latin and Cyrillic). The OS is identified by a C + V + sequence. e.g. lakshami→la ksha mi, mum-bai→mu mbai. The OS could contains multiple terminal vowel characters representing long vowels (oo in cool) or diphthongs (ai in mumbai). A vowel start-

Translation Models
We compared the orthographic syllable level model (O) with models based on other translation units that have been reported in previous work: word (W), morpheme (M), unigram (C) and trigram characters. Table 1 shows examples of these representations.
The first step to build these translation systems is to transform sentences to the correct representation. Each word is segmented as the per the unit of representation, punctuations are retained and a special word boundary marker character (_) is introduced to indicate word boundaries as shown here: For all units of representation, we trained phrasebased SMT (PBSMT) systems. Since related languages have similar word order, we used distance based distortion model and monotonic decoding. For character and orthographic syllable level models, we use higher order (10-gram) languages models since data sparsity is a lesser concern due to small vocabulary size (Vilar et al., 2007). As suggested by Nakov and Tiedemann (2012), we used word-level tuning for character and orthographic syllable level models by post-processing n-best lists in each tuning step to calculate the usual word-based BLEU score.
While decoding, the word and morpheme level systems will not be able to translate OOV words. Since the languages involved share vocabulary, we transliterate the untranslated words resulting in the post-edited systems W X and M X corresponding to the systems W and M respectively. Following decoding, we used a simple method to regenerate words from sub-word level units: Since we represent word boundaries using a word boundary marker, we

Experimental Setup
Languages: Our experiments primarily concentrated on multiple language pairs from the two major language families of the Indian sub-continent (Indo-Aryan branch of Indo-European and Dravidian). These languages have been in contact for a long time, hence there are many lexical and grammatical similarities among them, leading to the subcontinent being considered a linguistic area (Emeneau, 1956). Specifically, there is overlap between the vocabulary of these languages to varying degrees due to cognates, language contact and loanwords from Sanskrit (throughout history) and English (in recent times). Table 2 lists the languages involved in the experiments and provides an indication of the lexical similarity between them in terms of the Longest Common Subsequence Ratio (LCSR) (Melamed, 1995) between the parallel training sentences at character level. All these language have a rich inflectional morphology with Dravidian languages, and Marathi and Konkani to some degree, being agglutinative. kok-mar and pan-hin have a high degree of lexical similarity. System details: PBSMT systems were trained using the Moses system (Koehn et al., 2007), with the grow-diag-final-and heuristic for extracting phrases, and Batch MIRA (Cherry and Foster, 2012) for tuning (default parameters). We trained 5-gram LMs with Kneser-Ney smoothing for word and morpheme level models and 10-gram LMs for character and OS level models. We used the BrahmiNet transliteration system (Kunchukuttan et al., 2015) for postediting, which is based on the transliteration Module in Moses (Durrani et al., 2014). We used unsupervised morphological segmenters trained with Morfessor (Virpioja et al., 2013) for obtaining morpheme representations. The unsupervised morphological segmenters were trained on the ILCI corpus and the Leipzig corpus (Quasthoff et al., 2006).The morph-segmenters and our implementation of orthographic syllabification are made available as part of the Indic NLP Library 1 .

Evaluation:
We use BLEU (Papineni et al., 2002) and Le-BLEU (Virpioja and Grönroos, 2015) for evaluation. Le-BLEU does fuzzy matches of words and hence is suitable for evaluating SMT systems that perform transformation at the sub-word level.

Results and Discussion
This section discusses the results on Indian and non-Indian languages and cross-domain translation. Table 3 compares the BLEU scores for various translation systems. The orthographic syllable level system is clearly better than all other systems. It significantly outperforms the character-level system (by 46% on an average). The character-based system is competitive only for highly lexically similar language pairs like pan-hin and kok-mar. The system also outperforms two strong baselines which address data sparsity: (a) a word-level system with transliteration of OOV words (10% improvement), (b) a morph-level system with transliteration of OOV words (5% improvement). The OS-level representation is more beneficial when morphologically rich   languages are involved in translation. Significantly, OS-level translation is also the best system for translation between languages of different language families. The Le-BLEU scores also show the same trend as BLEU scores (See Appendix B). There are a very small number of untranslated OSes, which we handled by simple mapping of untranslated characters from source to target script. This barely increased translation accuracy (0.02% increase in BLEU score).

Why is OS better than other units?:
The improved performance of OS level representation can be attributed to the following factors: One, the number of basic translation units is limited and small compared to word-level and  Two, while character level representation too does not suffer from data sparsity, we observe that the translation accuracy is highly correlated to lexical similarity (Table 4). The high correlation of character-level system and lexical similarity explains why character-level translation performs nearly as well other methods for language pairs which have high lexical similarity, but performs badly otherwise. On the other hand, the OSlevel representation has lesser correlation with lexical similarity and sits somewhere between characterlevel and word/morpheme level systems. Hence it is able to make generalizations beyond simple character level mappings. We observed that OS-level representation was able to correctly generate words whose translations are not cognate with the source language. This is an important property since function words and suffixes tend to be less similar lexically across languages.
Can improved translation performance be explained by longer basic translation units? To verify this, we trained translation systems with character trigrams as basic units. We chose trigrams since the average length of the OS was 3-5 characters for the languages we tested with. The translation accuracies were far less than even unigram representation. The number of unique basic units was about 8-10 times larger than orthographic syllables, thus making data sparsity an issue again. So, improved translation performance cannot be attributed to longer n-gram units alone.   and tel-mal language pairs for the system M X (accuracies of the OS-level system are within 10% of the M X system). Since the word leve model depends on coverage of the lexicon, it is highly domain dependent, whereas the sub-word units are not. So, even unigram-level models outperform word-level models in a cross-domain setting. Table 6 shows the corpus statistics and our results for translation between some related non-Indic language pairs (Bulgarian-Macedonian, Danish-Swedish, Malay-Indonesian). OS level representation outperforms character and word level representation, though the gains are not as significant as Indic language pairs. This could be due to short length of sentences in training corpus [OPUS movie subtitles (Tiedemann, 2009b)] and high lexical similarity between the language pairs. Further experiments between less lexically related languages on general parallel corpora will be useful.

Experiments with non-Indian languages:
Effect of training data size: For different training set sizes, we trained SMT systems with various representation units ( Figure 1 shows the learning curves for two language pairs). OS level models are consistently better than character level models even for small training corpora. While the OSlevel models benefit from larger training corpora, the character-level models show only small benefits. For smaller corpora, the OS level model shows even larger performance gains over the corresponding word and morph level models. The performance gap between word/morph-level and OS-level shows a decrease with increase in training data.

Conclusion & Future Work
We focus on the task of translation between related languages. This aspect of MT research is important to make available translation technologies to language pairs with limited parallel corpus, but huge potential translation requirements. We propose the use of the orthographic syllable, a variablelength, linguistically motivated, approximate syllable, as a basic unit for translation between related languages. We show that it significantly outperforms other units of representation, over multiple language pairs, spanning different language families, with varying degrees of lexical similarity and is robust to domain changes too. This opens up the possibility of further exploration of sub-word level translation units e.g. segments learnt using byte pair encoding (Sennrich et al., 2016).