RedDust: a Large Reusable Dataset of Reddit User Traits

Anna Tigunova, Paramita Mirza, Andrew Yates, Gerhard Weikum


Abstract
Social media is a rich source of assertions about personal traits, such as “I am a doctor” or “my hobby is playing tennis”. Precisely identifying explicit assertions is difficult, though, because of the users’ highly varied vocabulary and language expressions. Identifying personal traits from implicit assertions like I’ve been at work treating patients all day is even more challenging. This paper presents RedDust, a large-scale annotated resource for user profiling for over 300k Reddit users across five attributes: profession, hobby, family status, age,and gender. We construct RedDust using a diverse set of high-precision patterns and demonstrate its use as a resource for developing learning models to deal with implicit assertions. RedDust consists of users’ personal traits, which are (attribute, value) pairs, along with users’ post ids, which may be used to retrieve the posts from a publicly available crawl or from the Reddit API. We discuss the construction of the resource and show interesting statistics and insights into the data. We also compare different classifiers, which can be learned from RedDust. To the best of our knowledge, RedDust is the first annotated language resource about Reddit users at large scale. We envision further use cases of RedDust for providing background knowledge about user traits, to enhance personalized search and recommendation as well as conversational agents.
Anthology ID:
2020.lrec-1.751
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
6118–6126
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.751
DOI:
Bibkey:
Cite (ACL):
Anna Tigunova, Paramita Mirza, Andrew Yates, and Gerhard Weikum. 2020. RedDust: a Large Reusable Dataset of Reddit User Traits. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6118–6126, Marseille, France. European Language Resources Association.
Cite (Informal):
RedDust: a Large Reusable Dataset of Reddit User Traits (Tigunova et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.751.pdf