Coarse “split and lump” bilingual language models for richer source information in SMT

Darlene Stewart, Roland Kuhn, Eric Joanis, George Foster


Abstract
Recently, there has been interest in automatically generated word classes for improving statistical machine translation (SMT) quality: e.g, (Wuebker et al, 2013). We create new models by replacing words with word classes in features applied during decoding; we call these “coarse models”. We find that coarse versions of the bilingual language models (biLMs) of (Niehues et al, 2011) yield larger BLEU gains than the original biLMs. BiLMs provide phrase-based systems with rich contextual information from the source sentence; because they have a large number of types, they suffer from data sparsity. Niehues et al (2011) mitigated this problem by replacing source or target words with parts of speech (POSs). We vary their approach in two ways: by clustering words on the source or target side over a range of granularities (word clustering), and by clustering the bilingual units that make up biLMs (bitoken clustering). We find that loglinear combinations of the resulting coarse biLMs with each other and with coarse LMs (LMs based on word classes) yield even higher scores than single coarse models. When we add an appealing “generic” coarse configuration chosen on English > French devtest data to four language pairs (keeping the structure fixed, but providing language-pair-specific models for each pair), BLEU gains on blind test data against strong baselines averaged over 5 runs are +0.80 for English > French, +0.35 for French > English, +1.0 for Arabic > English, and +0.6 for Chinese > English.
Anthology ID:
2014.amta-researchers.3
Volume:
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track
Month:
October 22-26
Year:
2014
Address:
Vancouver, Canada
Editors:
Yaser Al-Onaizan, Michel Simard
Venue:
AMTA
SIG:
Publisher:
Association for Machine Translation in the Americas
Note:
Pages:
28–41
Language:
URL:
https://aclanthology.org/2014.amta-researchers.3
DOI:
Bibkey:
Cite (ACL):
Darlene Stewart, Roland Kuhn, Eric Joanis, and George Foster. 2014. Coarse “split and lump” bilingual language models for richer source information in SMT. In Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track, pages 28–41, Vancouver, Canada. Association for Machine Translation in the Americas.
Cite (Informal):
Coarse “split and lump” bilingual language models for richer source information in SMT (Stewart et al., AMTA 2014)
Copy Citation:
PDF:
https://aclanthology.org/2014.amta-researchers.3.pdf