The Design and Construction of a Chinese Sarcasm Dataset

Xiaochang Gong, Qin Zhao, Jun Zhang, Ruibin Mao, Ruifeng Xu


Abstract
As a typical multi-layered semi-conscious language phenomenon, sarcasm is widely existed in social media text for enhancing the emotion expression. Thus, the detection and processing of sarcasm is important to social media analysis. However, most existing sarcasm dataset are in English and there is still a lack of authoritative Chinese sarcasm dataset. In this paper, we presents the design and construction of a largest high-quality Chinese sarcasm dataset, which contains 2,486 manual annotated sarcastic texts and 89,296 non-sarcastic texts. Furthermore, a balanced dataset through elaborately sampling the same amount non-sarcastic texts for training sarcasm classifier. Using the dataset as the benchmark, some sarcasm classification methods are evaluated.
Anthology ID:
2020.lrec-1.619
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
5034–5039
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.619
DOI:
Bibkey:
Cite (ACL):
Xiaochang Gong, Qin Zhao, Jun Zhang, Ruibin Mao, and Ruifeng Xu. 2020. The Design and Construction of a Chinese Sarcasm Dataset. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 5034–5039, Marseille, France. European Language Resources Association.
Cite (Informal):
The Design and Construction of a Chinese Sarcasm Dataset (Gong et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.619.pdf