Using Related Languages to Enhance Statistical Language Models

The success of many language modeling methods and applications relies heavily on the amount of data available. This problem is further exacerbated in statistical machine translation, where parallel data in the source and target languages is required. However, large amounts of data are only available for a small number of languages; as a result, many language modeling techniques are inadequate for the vast majority of languages. In this paper, we attempt to lessen the problem of a lack of training data for low-resource languages by adding data from related high-resource languages in three experiments. First, we interpolate language models trained on the target language and on the related language. In our second experiment, we select the sentences most similar to the target language and add them to our training corpus. Finally, we integrate data from the related language into a translation model for a statistical machine translation application. Although we do not see many signiﬁcant improvements over baselines trained on a small amount of data in the target language, we discuss some further experiments that could be attempted in order to augment language models and translation models with data from related languages.


Introduction
Statistical language modeling methods are an essential part of many language processing applications, including automatic speech recognition (Stolcke, 2002), machine translation (Kirchhoff and Yang, 2005), and information retrieval (Liu and Croft, 2005). However, their success is heavily dependent on the availability of suitably large text resources for training (Chen and Goodman, 1996). Such data can be hard to obtain, especially for low-resource languages. This problem is especially acute when language modeling is used in statistical machine translation, where a lack of parallel resources for a language pair can be a significant detriment to quality.
Our goal is to exploit a high-resource language to improve modeling of a related low-resource language, which is applicable to cases where the target language is closely related to a language with a large amount of text data available. For example, languages that are not represented in the European Parliament, such as Catalan, can be aided by related languages that are, such as Spanish. The data available from the related high-resource language can be adapted in order to add to the translation model or the language model of the target language. This paper is an initial attempt at using minimally transformed data from a related language to enhance language models and increase parallel data for SMT.

Domain Adaptation
This problem can be seen as a special case of domain adaptation, with the in-domain data being the data in the target language and the out-of-domain data being the data in the related language . Domain adaptation is often used to leverage resources for a specific domain, such as biomedical text, from more general domains like newswire data (Dahlmeier and Ng, 2010). This idea can be applied to SMT, where data from the related lan-guage can be adapted to look like data from the lowresource language. It has been shown that training on a large amount of adapted text significantly improves results compared to training on a small indomain corpus or training on unadapted data (Wang et al., 2012). In this paper, we apply two particular domain adaptation approaches. First, we interpolate language models from in-domain and out-of-domain data, following Koehn and Schroeder (2007). We also attempt to select the best out-of-domain data using perplexity, similar to what was done in Gao et al. (2002).

Machine Translation
In contrast to transfer-based and word-based machine translation, for statistical machine translation, quality is heavily dependent on the amount of parallel resources. Given the difficulty of obtaining sufficient parallel resources, this can be a problem for many language pairs. For those cases, a third language can be used as a pivot. The process of using a third language as a bridge instead of directly translating is called triangulation (Singla et al., 2014). Character-level translation combined with word-level translation has also been shown to be an improvement over phrase-based approaches for closely related languages (Nakov and Tiedemann, 2012). Similarly, transliteration methods using cognate extraction  and bilingual dictionaries (Kirschenbaum and Wintner, 2010) can be used to aid the low-resource language.
3 Experimental Framework

Choice of Languages
For the purpose of our experiments, we treat Spanish as if it were a low-resource language and test Spanish language models and English-Spanish translations. We use Italian and Portuguese as the closelyrelated languages. Using these languages for our experiments allows us to compare the results to the language models and machine translations that can be created using large corpora.
Spanish, Portuguese, and Italian all belong to the Romance family of Indo-European languages. Spanish has strong lexical similarity with both Portuguese (89%) and Italian (82%) (Lewis, 2015). Among major Romance languages, Spanish and Portuguese have been found to be the closest pair in automatic corpus comparisons (Ciobanu and Dinu, 2014) and in comprehension studies (Voigt and Gooskens, 2014), followed by Spanish and Italian.

Data
We used the Europarl corpus (Koehn, 2005) for training and testing. In order to use the data in our experiments, we tokenized 1 the corpus, converted all words to lowercase, and collapsed all numerical symbols into one special symbol. Finally, we transliterated the Italian and Portuguese corpora to make them more Spanish-like; this process is described in section 3.3.
The data that was used to train, test and develop is split as follows: 10% of the Spanish data (196,221 sentences) was used for testing, 10% for development, and the remaining 80% (1,569,771 sentences) for training. The Italian and Portuguese corpora were split similarly and training sizes for the models varied between 30K and 1,523,304 and 1,566,015 sentences for Italian and Portuguese, respectively.

Transliteration
In order to use Italian and Portuguese data to model Spanish, we first transliterated the Italian and Portuguese training corpora using a naive rulebased transliteration method consisting of wordlevel string transformations and a small bilingual dictionary. For the bilingual dictionary, the 200 most common words were extracted from the Italian and the Portuguese training corpora and manually given Spanish translations. In translating to Spanish, an effort was made to keep cognates where possible, and to use the most likely or common meanings. Table 1 gives translations used for the ten most common Italian words in the data. Even in this small sample, there is a problematic translation. The Italian preposition per can be translated to por or para. In keeping with the desire to use a small amount of data, we briefly read the Italian texts to find the translation we felt was more likely (para), and chose that as the translation for all instances of per in the training set. We also verified that para was more likely in the Spanish training text overall than por. The rule-based component of the transliteration consisted of handwritten word-initial, word-final, and general transformation rules. We applied approximately fifty such rules per language to the data. In order to come up with the rules, we examined the pan-Romance vocabulary list compiled by Euro-ComRom (Klein, 2002); however, such rules could be derived by an expert with knowledge of the relevant languages with relatively little effort. Character clusters that were impossible in Spanish were converted to their most common correspondence in Spanish (in the word list). We also identified certain strings that had consistent correspondences in Spanish and replaced them appropriately. These rules were applied to all words in the Italian and Portuguese training data except for those that were in the bilingual dictionary. See

Transliteration into Spanish
La dificoldad de conciliar estos obietivos risiede en el hecho que las logique de estos setores son contraditorie. Our first experiment attempted to use language models trained on the transliterated data to increase the coverage of a language model based on Spanish data; this was modeled after Koehn and Schroeder (2007). The language models in this experiment were trigram models with Good-Turing smoothing built using SRILM (Stolcke, 2002). As baselines, we trained Spanish (es) LMs on a small amount (30K sentences) and a large amount (1.5M sentences) of data. We also trained language models based on 30K transliterated and standard Italian (it) and Portuguese (pt) sentences. All were tested on the Spanish test set. Table 4 shows the perplexity for each of the baselines. As expected, more Spanish training data led to a lower perplexity. However, the transliterated Italian and Portuguese baselines yielded better perplexity with less data. Note also the strong effect of transliteration.  In the experiment, we interpolated LMs trained on different amounts of transliterated data with the LM trained on 30K Spanish sentences. We used SRILM's compute-best-mix tool to determine the interpolation weights of the models. This parameter was trained on the Spanish development set. Table 5 shows the results for the interpolation of the Spanish LM with Italian and Portuguese, both separately and simultaneously. The lambda values are the weights given to each of the language models. None of the interpolated combinations improves on the perplexity of the smallest Spanish baseline. The best results for interpolated language models are achieved when combining the 30K-sentence Spanish model with the 1.5Msentence Portuguese model, which almost reaches the perplexity level of the Spanish-only model. As a comparison, we also interpolated two separate language models, each trained on 30K Spanish sentences; the weight for these models was close to 0.5.
In the best-performing language model mix that used all three languages, Portuguese was weighted with a lambda of about 0.17, whereas Italian was only weighted with 0.016. That shows that Portuguese, in this setup, is a better model of Spanish.
An open question has to do with the performance of the Portuguese language model in the experiment compared to the baselines. In table 4, we see that the language model does significantly worse when trained on more Portuguese data. However, the interpolation of the Spanish and Portuguese language models yields a lower perplexity when trained on a large amount of Portuguese data. Since the data was identical in the baselines and experiments, further exploration is needed to understand this behavior.

Experiment 2: Corpus Selection
For our second experiment, our goal was to select the most "Spanish-like" data from our Italian and Portuguese corpora. We concatenated this data with the Spanish sentences in order to increase the amount of training data for the language model. This is similar to what was done by Gao et al. (2002).
First, we trained a language model on our small Spanish corpus. This language model was then queried on a concatenation of the transliterated Italian and Portuguese data. The sentences in this corpus were ranked according to their perplexity in the Spanish LM. We selected the best 30K and 5K sentences, which were then concatenated with the Spanish data to form a larger corpus. Finally, we used KenLM (Heafield, 2011) to create a trigram language model with Kneser-Ney smoothing (Kneser and Ney, 1995) on that data. We also ran the same experiment on Italian and Portuguese separately. Table 6 gives the results from these experiments. This table shows that the mixed-language models for each language performed better when they had a lower amount of non-Spanish data. This indicates that it is better to simply use a small amount of data in the low-resource language, rather than trying to augment it with the transliterated data from related languages. Using a smaller amount of the Spanish data, having a different strategy for selecting the non-Spanish data, using a different transliteration method, or using Italian and Portuguese data that was not a direct translation of the Spanish data may have all led to improvements. It is also interesting to note that the language models based on the corpus containing only Portuguese performed almost as well as those based on the corpus containing Portuguese and Italian. This indicates that the Portuguese data likely had more Spanish-like sentences than the Italian data. As mentioned in section 3.1, Portuguese is more similar to Spanish, so this makes intuitive sense. However, it is surprising given the results in table 4, which shows that the Italian-only language models performed better on Spanish data than the Portuguese-only language models.

Experiment 3: Statistical Machine Translation
Lastly, we experimented with translation models in order to see if our approach yielded similar results. For our baseline, we used a small parallel corpus of 30K English-Spanish (en-es) sentences from the Europarl corpus (Koehn, 2005). The data was preprocessed as described in section 3.2. Since SMT systems are often trained on large amounts of data, we expected poor coverage with this dataset. However, this size would be representative of the amount of data available for low-resource languages. We used Moses  to train our phrase-based SMT system on the above mentioned parallel corpus (en-es). We also trained a language model of 5M words of Spanish data from the same source, making sure that this data was strictly distinct from our parallel data. The language model was trained using KenLM (Heafield, 2011   weights were set by optimizing BLEU using MERT on a separate development set of 2,000 sentences (English-Spanish). After decoding, we detokenized and evaluated the output. For the evaluation, we used a clean Spanish test set of 2,000 sentences from the same source. As an automatic evaluation measure, we used BLEU (Papineni et al., 2002) for quantitative evaluation.
For our experiments, we used Italian and Portuguese as auxiliary languages. We created two corpora of 30K sentences each from the Europarl corpus, en-it and en-pt. We first tokenized and transliterated the training corpus of the related language as described in section 3.3. Then, we concatenated the resulting corpora with our baseline corpus and trained our model. This is similar to what was done by , although we attempt to translate into the low-resource language. We first experimented with each auxiliary language independently and then with both languages. In total we conducted the following experiments: • English-Spanish (en-es) + English-Italian transliterated (en-es it ) • English-Spanish (en-es) + English-Portuguese transliterated (en-es pt ) • English-Spanish (en-es) + English-Italian transliterated (en-es it ) + English-Portuguese transliterated (en-es pt ) In this experiment, we expected to observe some improvements compared to the language modeling experiments, as the mistakes in the transliterated output could be filtered out by the language model containing clean Spanish data. Moreover, we examined whether it is possible to have gains from using multiple related languages simultaneously.

Languages
Sentences BLEU p-value en-es (Baseline) 30K 0.3360 en-es + en-es it 30K + 30K 0.3357 0.22 en-es + en-es pt 30K + 30K 0.3349 0.08 en-es + en-es it + en-es pt 30K + 30K + 30K 0.3384 0.041  Table 7 shows the BLEU scores for the experiments. To determine whether our results were significant we used the bootstrap resampling method (Koehn, 2004), which is part of Moses. There were no significant improvements in BLEU score when only one auxiliary language was used. Nonetheless, we observed a significant improvement when data from both Italian and Portuguese is used. This may be an indication that more out-of domain data, when used in the translation model and sufficiently transformed, can actually improve performance.
One open question at this point is whether the improvement was caused by the contribution of more than one language or simply by the increase in training data. It is possible that a similar improvent could be achieved by increasing the data of one language to 60K. However, in order to support our conjecture, it will be necessary to conduct experiments with different sizes and combinations of data from the related languages.

Discussion
We observed that a closely-related language cannot be used to aid in modeling a low-resource language without being properly transformed. Although our naive rule-based transliteration method strongly improved over the non-transliterated closely-related language data, it performed worse than even a small amount of target language data. In addition, adding more data from the related language caused the models to do worse; this may be because there were more words in the data that were not translated using the 200-word dictionary, so there was more noise from the rule-based transliterations in the data. Thus, we were not successful in using data from a related language to improve language modeling for a low-resource language.
For statistical machine translation, our results show gains from augmenting the translation models of a low-resource language with transliterated related-language data. We expect that by taking advantage of more sophisticated transliteration and interpolation methods as well as larger amounts of data from the closely-related language(s), larger improvements in BLEU can be achieved.

Future Work
We plan on experimenting with more sophisticated ways of transforming related language data, including unsupervised and semi-supervised translitera-tion methods. We would particularly like to experiment with neural network machine transliteration using a character-based LSTM network. This could be developed based on small parallel texts or lists of bilingual cognates of varying sizes. We could also use existing transliteration modules integrated in the SMT system (Durrani et al., 2014). In addition, we hope to explore using bilingual dictionaries without transliteration, as well as using phonological transcription as an intermediary between the two related languages. Finally, it would be beneficial to examine the contribution of each of the rules in our rule-based system separately.
A relatively simple modification to our experiments would be to use more data in creating the translation model (in experiment 3). While we found that using more of the high-resource language data in the language models yielded higher perplexity, the same did not carry over to BLEU scores, especially since we saw a slight improvement in BLEU score when using both Portuguese and Italian data. A similar option would be to select the best Italian and Portuguese data (as was done in experiment 2) for use in the translation model, instead of selecting random sentences.
In statistical machine translation, it would be interesting to explore methods of using data from related languages while preserving the reliable information from the low-resource language. One idea could be methods for interpolating phrase tables for the transliterated corpora as well as setting optimal weights for each of them, similar to the approach of Sennrich (2012). We would also like to improve the translation model coverage by filling up the phrase table for a low-resource language with data from a related language while keeping the useful data from the low-resource language (Bisazza et al., 2011) or by using the related languages as a back-off (Yang and Kirchhoff, 2006).
Finally, a weakness of our language modeling experiments was that we used almost parallel data between the related and the target languages. Hence, the related language was not likely to increase the vocabulary coverage of the models; instead, it just added misspellings of the target language words. In the future, we would like to run experiments with data from the related languages that is strictly distinct from the data of the low-resource language.