Kitten: a tool for normalizing HTML and extracting its textual content

Mathieu-Henri Falco; Véronique Moriceau; Anne Vilnat

Kitten: a tool for normalizing HTML and extracting its textual content

Mathieu-Henri Falco, Véronique Moriceau, Anne Vilnat

Abstract

The web is composed of a gigantic amount of documents that can be very useful for information extraction systems. Most of them are written in HTML and have to be rendered by an HTML engine in order to display the data they contain on a screen. HTML file thus mix both informational and rendering content. Our goal is to design a tool for informational content extraction. A linear extraction with only a basic filtering of rendering content would not be enough as objects such as lists and tables are linearly coded but need to be read in a non-linear way to be well interpreted. Besides these HTML pages are often incorrectly coded from an HTML point of view and use a segmentation of blocks based on blank space that cannot be transposed in a text filewithout confusing syntactic parsers. For this purpose, we propose the Kitten tool that first normalizes HTML file into unicode XHTML file, then extracts the informational content into a text filewith a special processing for sentences, lists and tables.

Anthology ID:: L12-1250
Volume:: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:: May
Year:: 2012
Address:: Istanbul, Turkey
Editors:: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 2261–2267
Language:
URL:: http://www.lrec-conf.org/proceedings/lrec2012/pdf/467_Paper.pdf
DOI:
Bibkey:
Cite (ACL):: Mathieu-Henri Falco, Véronique Moriceau, and Anne Vilnat. 2012. Kitten: a tool for normalizing HTML and extracting its textual content. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 2261–2267, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):: Kitten: a tool for normalizing HTML and extracting its textual content (Falco et al., LREC 2012)
Copy Citation:
PDF:: http://www.lrec-conf.org/proceedings/lrec2012/pdf/467_Paper.pdf

PDF Cite Search