A Cross-Genre Ensemble Approach to Robust Reddit Part of Speech Tagging

Shabnam Behzad, Amir Zeldes


Abstract
Part of speech tagging is a fundamental NLP task often regarded as solved for high-resource languages such as English. Current state-of-the-art models have achieved high accuracy, especially on the news domain. However, when these models are applied to other corpora with different genres, and especially user-generated data from the Web, we see substantial drops in performance. In this work, we study how a state-of-the-art tagging model trained on different genres performs on Web content from unfiltered Reddit forum discussions. We report the results when training on different splits of the data, tested on Reddit. Our results show that even small amounts of in-domain data can outperform the contribution of data an order of magnitude larger coming from other Web domains. To make progress on out-of-domain tagging, we also evaluate an ensemble approach using multiple single-genre taggers as input features to a meta-classifier. We present state of the art performance on tagging Reddit data, as well as error analysis of the results of these models, and offer a typology of the most common error types among them, broken down by training corpus.
Anthology ID:
2020.wac-1.7
Volume:
Proceedings of the 12th Web as Corpus Workshop
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Adrien Barbaresi, Felix Bildhauer, Roland Schäfer, Egon Stemle
Venue:
WAC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
50–56
Language:
English
URL:
https://aclanthology.org/2020.wac-1.7
DOI:
Bibkey:
Cite (ACL):
Shabnam Behzad and Amir Zeldes. 2020. A Cross-Genre Ensemble Approach to Robust Reddit Part of Speech Tagging. In Proceedings of the 12th Web as Corpus Workshop, pages 50–56, Marseille, France. European Language Resources Association.
Cite (Informal):
A Cross-Genre Ensemble Approach to Robust Reddit Part of Speech Tagging (Behzad & Zeldes, WAC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.wac-1.7.pdf
Code
 shabnam-b/reddit-pos-ensemble
Data
English Web TreebankGUM