Arabic-Segmentation Combination Strategies for Statistical Machine Translation

Saab Mansour, Hermann Ney


Abstract
Arabic segmentation was already applied successfully for the task of statistical machine translation (SMT). Yet, there is no consistent comparison of the effect of different techniques and methods over the final translation quality. In this work, we use existing tools and further re-implement and develop new methods for segmentation. We compare the resulting SMT systems based on the different segmentation methods over the small IWSLT 2010 BTEC and the large NIST 2009 Arabic-to-English translation tasks. Our results show that for both small and large training data, segmentation yields strong improvements, but, the differences between the top ranked segmenters are statistically insignificant. Due to the different methodologies that we apply for segmentation, we expect a complimentary variation in the results achieved by each method. As done in previous work, we combine several segmentation schemes of the same model but achieve modest improvements. Next, we try a different strategy, where we combine the different segmentation methods rather than the different segmentation schemes. In this case, we achieve stronger improvements over the best single system. Finally, combining schemes and methods has another slight gain over the best combination strategy.
Anthology ID:
L12-1279
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
3915–3920
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/509_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Saab Mansour and Hermann Ney. 2012. Arabic-Segmentation Combination Strategies for Statistical Machine Translation. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 3915–3920, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
Arabic-Segmentation Combination Strategies for Statistical Machine Translation (Mansour & Ney, LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/509_Paper.pdf