Two Database Resources for Processing Social Media English Text

Eleanor Clark, Kenji Araki


Abstract
This research focuses on text processing in the sphere of English-language social media. We introduce two database resources. The first, CECS (Casual English Conversion System) database, a lexicon-type resource of 1,255 entries, was constructed for use in our experimental system for the automated normalization of casual, irregularly-formed English used in communications such as Twitter. Our rule-based approach primarily aims to avoid problems caused by user creativity and individuality of language when Twitter-style text is used as input in Machine Translation, and to aid comprehension for non-native speakers of English. Although the database is still under development, we have so far carried out two evaluation experiments using our system which have shown positive results. The second database, CEGS (Casual English Generation System) phoneme database contains sets of alternative spellings for the phonemes in the CMU Pronouncing Dictionary, designed for use in a system for generating phoneme-based casual English text from regular English input; in other words, automatically producing humanlike creative sentences as an AI task. This paper provides an overview of the necessity, method, application and evaluation of both resources.
Anthology ID:
L12-1124
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
3790–3793
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/288_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Eleanor Clark and Kenji Araki. 2012. Two Database Resources for Processing Social Media English Text. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 3790–3793, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
Two Database Resources for Processing Social Media English Text (Clark & Araki, LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/288_Paper.pdf