MTWatch: A Tool for the Analysis of Noisy Parallel Data

Sandipan Dandapat, Declan Groves


Abstract
State-of-the-art statistical machine translation (SMT) technique requires a good quality parallel data to build a translation model. The availability of large parallel corpora has rapidly increased over the past decade. However, often these newly developed parallel data contains contain significant noise. In this paper, we describe our approach for classifying good quality parallel sentence pairs from noisy parallel data. We use 10 different features within a Support Vector Machine (SVM)-based model for our classification task. We report a reasonably good classification accuracy and its positive effect on overall MT accuracy.
Anthology ID:
L14-1248
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
41–45
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/272_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Sandipan Dandapat and Declan Groves. 2014. MTWatch: A Tool for the Analysis of Noisy Parallel Data. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 41–45, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
MTWatch: A Tool for the Analysis of Noisy Parallel Data (Dandapat & Groves, LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/272_Paper.pdf