Dong Nguyen


2019

pdf bib
Aiming beyond the Obvious: Identifying Non-Obvious Cases in Semantic Similarity Datasets
Nicole Peinelt | Maria Liakata | Dong Nguyen
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Existing datasets for scoring text pairs in terms of semantic similarity contain instances whose resolution differs according to the degree of difficulty. This paper proposes to distinguish obvious from non-obvious text pairs based on superficial lexical overlap and ground-truth labels. We characterise existing datasets in terms of containing difficult cases and find that recently proposed models struggle to capture the non-obvious cases of semantic similarity. We describe metrics that emphasise cases of similarity which require more complex inference and propose that these are used for evaluating systems for semantic similarity.

pdf bib
Room to Glo: A Systematic Comparison of Semantic Change Detection Approaches with Word Embeddings
Philippa Shoemark | Farhana Ferdousi Liza | Dong Nguyen | Scott Hale | Barbara McGillivray
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Word embeddings are increasingly used for the automatic detection of semantic change; yet, a robust evaluation and systematic comparison of the choices involved has been lacking. We propose a new evaluation framework for semantic change detection and find that (i) using the whole time series is preferable over only comparing between the first and last time points; (ii) independently trained and aligned embeddings perform better than continuously trained embeddings for long time periods; and (iii) that the reference point for comparison matters. We also present an analysis of the changes detected on a large Twitter dataset spanning 5.5 years.

pdf bib
Challenges and frontiers in abusive content detection
Bertie Vidgen | Alex Harris | Dong Nguyen | Rebekah Tromble | Scott Hale | Helen Margetts
Proceedings of the Third Workshop on Abusive Language Online

Online abusive content detection is an inherently difficult task. It has received considerable attention from academia, particularly within the computational linguistics community, and performance appears to have improved as the field has matured. However, considerable challenges and unaddressed frontiers remain, spanning technical, social and ethical dimensions. These issues constrain the performance, efficiency and generalizability of abusive content detection systems. In this article we delineate and clarify the main challenges and frontiers in the field, critically evaluate their implications and discuss potential solutions. We also highlight ways in which social scientific insights can advance research. We discuss the lack of support given to researchers working with abusive content and provide guidelines for ethical research.

2018

pdf bib
Comparing Automatic and Human Evaluation of Local Explanations for Text Classification
Dong Nguyen
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Text classification models are becoming increasingly complex and opaque, however for many applications it is essential that the models are interpretable. Recently, a variety of approaches have been proposed for generating local explanations. While robust evaluations are needed to drive further progress, so far it is unclear which evaluation approaches are suitable. This paper is a first step towards more robust evaluations of local explanations. We evaluate a variety of local explanation approaches using automatic measures based on word deletion. Furthermore, we show that an evaluation using a crowdsourcing experiment correlates moderately with these automatic measures and that a variety of other factors also impact the human judgements.

2017

pdf bib
A Kernel Independence Test for Geographical Language Variation
Dong Nguyen | Jacob Eisenstein
Computational Linguistics, Volume 43, Issue 3 - September 2017

Quantifying the degree of spatial dependence for linguistic variables is a key task for analyzing dialectal variation. However, existing approaches have important drawbacks. First, they are based on parametric models of dependence, which limits their power in cases where the underlying parametric assumptions are violated. Second, they are not applicable to all types of linguistic data: Some approaches apply only to frequencies, others to boolean indicators of whether a linguistic variable is present. We present a new method for measuring geographical language variation, which solves both of these problems. Our approach builds on Reproducing Kernel Hilbert Space (RKHS) representations for nonparametric statistics, and takes the form of a test statistic that is computed from pairs of individual geotagged observations without aggregation into predefined geographical bins. We compare this test with prior work using synthetic data as well as a diverse set of real data sets: a corpus of Dutch tweets, a Dutch syntactic atlas, and a data set of letters to the editor in North American newspapers. Our proposed test is shown to support robust inferences across a broad range of scenarios and types of data.

2016

pdf bib
Survey: Computational Sociolinguistics: A Survey
Dong Nguyen | A. Seza Doğruöz | Carolyn P. Rosé | Franciska de Jong
Computational Linguistics, Volume 42, Issue 3 - September 2016

pdf bib
Automatic Detection of Intra-Word Code-Switching
Dong Nguyen | Leonie Cornips
Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

2015

pdf bib
#SupportTheCause: Identifying Motivations to Participate in Online Health Campaigns
Dong Nguyen | Tijs van den Broek | Claudia Hauff | Djoerd Hiemstra | Michel Ehrenhard
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
On the Impact of Twitter-based Health Campaigns: A Cross-Country Analysis of Movember
Nugroho Dwi Prasetyo | Claudia Hauff | Dong Nguyen | Tijs van den Broek | Djoerd Hiemstra
Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis

2014

pdf bib
Why Gender and Age Prediction from Tweets is Hard: Lessons from a Crowdsourcing Experiment
Dong Nguyen | Dolf Trieschnigg | A. Seza Doğruöz | Rilana Gravel | Mariët Theune | Theo Meder | Franciska de Jong
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib
TweetGenie: Development, Evaluation, and Lessons Learned
Dong Nguyen | Dolf Trieschnigg | Theo Meder
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: System Demonstrations

pdf bib
Predicting Code-switching in Multilingual Communication for Immigrant Communities
Evangelos Papalexakis | Dong Nguyen | A. Seza Doğruöz
Proceedings of the First Workshop on Computational Approaches to Code Switching

2013

pdf bib
Word Level Language Identification in Online Multilingual Communication
Dong Nguyen | A. Seza Doğruöz
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib
Learning to Extract Folktale Keywords
Dolf Trieschnigg | Dong Nguyen | Mariët Theune
Proceedings of the 7th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

2011

pdf bib
Language use as a reflection of socialization in online communities
Dong Nguyen | Carolyn P. Rosé
Proceedings of the Workshop on Language in Social Media (LSM 2011)

pdf bib
Author Age Prediction from Text using Linear Regression
Dong Nguyen | Noah A. Smith | Carolyn P. Rosé
Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities