Structural alignment of plain text books

André Santos, José João Almeida, Nuno Carvalho


Abstract
Text alignment is one of the main processes for obtaining parallel corpora. When aligning two versions of a book, results are often affected by unpaired sections ― sections which only exist in one of the versions of the book. We developed Text::Perfide::BookSync, a Perl module which performs books synchronization (structural alignment based on section delimitation), provided they have been previously annotated by Text::Perfide::BookCleaner. We discuss the need for such a tool and several implementation decisions. The main functions are described, and examples of input and output are presented. Text::Perfide::PartialAlign is an extension of the partialAlign.py tool bundled with hunalign which proposes an alternative methods for splitting bitexts.
Anthology ID:
L12-1576
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
2069–2074
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/967_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
André Santos, José João Almeida, and Nuno Carvalho. 2012. Structural alignment of plain text books. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 2069–2074, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
Structural alignment of plain text books (Santos et al., LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/967_Paper.pdf