Data representation methods and use of mined corpora for Indian language transliteration

Our NEWS 2015 shared task submission is a PBSMT based transliteration system with the following corpus preprocessing enhancements: (i) addition of word-boundary markers, and (ii) language-independent, overlapping character segmentation. We show that the addition of word-boundary markers improves transliteration accuracy substantially, whereas our overlapping segmentation shows promise in our preliminary analysis. We also compare transliteration systems trained using manually created corpora with the ones mined from parallel translation corpus for English to Indian language pairs. We identify the major errors in English to Indian language transliterations by analyzing heat maps of confusion matrices.


Introduction
Machine Transliteration can be viewed as a problem of transforming a sequence of characters in one alphabet to another. Transliteration can be seen as a special case of the general translation problem between two languages. The primary differences from the general translation problem are: (i) limited vocabulary size, and (ii) simpler grammar with no reordering. Phrase based statistical machine translation (PB-SMT) is a robust and well-understood technology and can be easily adopted for application to the transliteration problem (Noeman, 2009;Finch and Sumita, 2010). Our submission to the NEWS 2015 shared task is a PBSMT system. Over a baseline PBSMT system, we address two issues: (i) suitable data representation for training, and (ii) parallel transliteration corpus availability.
In many writing systems, the same logical/phonetic symbols can have different charac-ter representations depending on whether it occurs in initial, medial or terminal word position. For instance, Indian scripts have different characters for independent vowels and vowel diacritics. Independent vowels typically occurs at the beginning of the word, while diacritics occur in medial and terminal positions. The pronounciation, and hence the transliteration could also depend on the position of the characters. For instance, the terminal ion in nation would be pronounced differently from initial one in ionize. PBSMT learning of character sequence mappings is agnostic of the position of the character in the word. Hence, we explore to transform the data representation to encode position information. Zhang et al. (2012) did not report any benefit from such a representation for Chinese-English transliteration. We investigated if such encoding useful for alphabetic and consonantal scripts as opposed to logographic scripts like Chinese.
It is generally believed that syllabification of the text helps improve transliteration systems. However, syllabification systems are not available for all languages. Tiedemann (2012) proposed a character-level, overlapping bigram representation in the context of machine translation using transliteration. We can view this as weak, coarse and language independent syllabification approach. We explore this overlapping, segmentation approach for the transliteration task.
For many language pairs, parallel transliteration corpora are not publicly available. However, parallel translation corpora like Europarl (Koehn, 2005) and ILCI (Jha, 2012) are available for many language pairs. Transliteration corpora mined from such parallel corpora has been shown to be useful for machine translation, cross lingual information retrieval, etc. (Kunchukuttan et al., 2014). In this paper, we make an intrinsic evaluation of the performance of the automatically mined Brah-miNet transliteration corpus (Kunchukuttan et al., 2015) for transliteration between English and Indian languages. The BrahmiNet corpus contains transliteration corpora for 110 Indian language pairs mined from the ILCI corpus, a parallel translation corpora of 11 Indian languages (Jha, 2012).
The rest of the paper is organized as follows. Section 2 and Section 3 describes our system and experimental setup respectively. Section 4 discusses the results of various data representation methods and the use of mined corpus respectively. Section 5 concludes the report.

System Description
We use a standard PB-SMT model for transliteration between the various language pairs. It is a discriminative, log-linear model which uses standard SMT features viz. direct/inverse phrase translation probabilities, direct/inverse lexical translation probabilities, phrase penalty, word penalty and language model score. The feature weights are tuned to optimize BLEU (Papineni et al., 2002) using the Minimum Error Rate Training algorithm (Och, 2003). It would be better to explore optimizing metrics like accuracy or edit distance instead of using BLEU as a proxy for these metrics. We experiment with various transliteration units as discussed in Section 2.1. We use a 5-gram language model over the transliteration units estimated using Witten-Bell smoothing. Since transliteration does not require any reordering, monotone decoding was done.

Data Representation
We create different transliteration models based on different basic transliteration units in the source and target training corpus. We use character (P) as well as bigram representations (T). In character based system, the character is the basic unit of transliteration. In bigram-based system, the overlapping bigram is the basic unit of transliteration. We also augmented the word representation with word boundary markers (M) (ˆfor start of word and $ end of word). The various representations we experimented with are illustrated below: The abbreviations mentioned above are used subsequently to refer to these data representations.

Use of mined transliteration corpus
We explore the use of transliteration corpora mined from translation corpora for transliteration. Sajjad et al. (2012) proposed an unsupervised method for mining transliteration pairs from parallel corpus. Their approach models parallel translation corpus generation as a generative process comprising an interpolation of a transliteration and a non-transliteration process. The parameters of the generative process are learnt using the EM procedure, followed by extraction of transliteration pairs from the parallel corpora by setting an appropriate threshold. We compare the quality of the transliteration systems built from such mined corpora with systems trained on manually created NEWS 2015 corpora for English-Indian language pairs.

Experimental Setup
For building the transliteration model with the NEWS 2015 shared task corpus as well as the BrahmiNet corpus, we used 500 word pairs for tuning and the rest for SMT training. The experimental results are reported on the NEWS 2015 development sets in both cases. The details of the NEWS 2015 shared task datasets are mentioned in shared text report, while the size of the BrahmiNet datasets are listed below:

Src Tgt Size
En Hi 10513 En Ba 7567 En Ta 3549 We use the Moses toolkit (Koehn et al., 2007) to train the transliteration system and the language models were estimated using the SRILM toolkit (Stolcke and others, 2002). The transliteration pairs are mined using the transliteration module in Moses (Durrani et al., 2014). Table 1 shows transliteration results for various data representation methods on the development set. We see improvements in transliteration accuracy of upto 18% due to the use of word-boundary markers. The MRR also shows an improvement of upto 15%. An analysis of improvement for the En-Hi pair shows that a major reason for the improve-  We also tried to identify the major errors in English to Indian languages using heat maps of the character-level confusion matrices (Figure 1 shows one for En-Hi). We observed that the following errors are common across all English-Indian language pairs in the shared task: (i) incorrect generation of vowel diacritics, especially confusion between long and short vowels, (ii) schwa deletion, (iii) confusion between dental and retroflex consonants, (iv) incorrect or spurious generation of halanta (inherent vowel suppressor) character as well as the aakar maatra (vowel diacritic for आ(aa)). Hi and Ba show confusion between sibilants (स,श,ष), while Ta and Ka exhibits incorrect or spurious generation of य (ya).

Effect of Data Representation methods
However, the use of a overlapping bigram representation does not show any significant improvement in results over the baseline output. The above results are for systems tuned to maximize BLEU. However, BLEU does not seem the most intuitive tuning metric for the the bigram representation. Hence, we compare the untuned output results (shown in Table 2 for a few language pairs). As we anticipated, we found that the bigram representation gave a significant improvement in accuracy (on an average of about 25%). The combination of word-boundary marker and bigram representation performs best. This suggests the need to tune the SMT system to an alternative metric like edit distance so that the benefit of bigram representation can be properly harnessed. The following is an example where bigram representation resulted in the correct generation of consonants, where the character representation made errors:   where the quality of mined corpus suffers on account of the presence of suffixes due to the agglutinative nature of the language. This results in some wrongly mined pairs as well as smaller number of word pairs being mined. The F-score does not suffer as much as top-1 accuracy and all languages have an F-score greater than 70%. The MRR suggests that the correct transliteration can be found in the top 3 candidates for Hi and Ba, and in the top-7 candidates for Ta. This shows that though the top-1 accuracy of the system is lower than a manually generated corpus, the use of the top-k candidates can be useful in downstream applications like machine translation and cross lingual IR. Since the NEWS 2015 corpus is larger than the BrahmiNet corpus, we train a random subset of the NEWS 2015 corpus of the same size as the BrahmiNet corpus. In addition, we also experiment with stricter selection thresholds in the mining process.

Transliteration using an automatically mined corpus
Since, NEWS 2015 development corpus is quite similar to the NEWS training corpus, we use another corpus (Gupta et al., 2012) to evaluate both the systems. In all these cases, the NEWS corpus gave superior accuracy as compared to Brah-miNet. To explain the superiority of the NEWS corpus over all the configurations, we computed the average entropy for the conditional transliter-ation probability (Chinnakotla et al., 2010). The average entropy for the P(En|Hi) distribution at the character level is higher for the BrahmiNet corpus (0.8) as compared to the NEWS 2015 corpus (0.574). The same observation is seen for the P(Hi|En) distribution. This means that there is a higher ambiguity in selecting transliteration in the BrahmiNet corpus.

Conclusion
We addressed data representation and availability issues in PBSMT based transliteration, with a special focus on English-Indian language pairs. We showed that adding boundary markers to the word representation helps to significantly improve the transliteration accuracy. We also noted that the an overlapping character segmentation can be useful subject to optimizing the appropriate evaluation metrics for transliteration systems. We show that though automatically mined corpora provided lower top-1 transliteration accuracy, the top-10 accuracy, MRR and F-score are competitive to justify the use of the top-k candidates from these mined corpora for translation and IR systems.