Michael Paul

Also published as: Michael J. Paul


2019

pdf bib
Neural User Factor Adaptation for Text Classification: Learning to Generalize Across Author Demographics
Xiaolei Huang | Michael J. Paul
Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019)

Language use varies across different demographic factors, such as gender, age, and geographic location. However, most existing document classification methods ignore demographic variability. In this study, we examine empirically how text data can vary across four demographic factors: gender, age, country, and region. We propose a multitask neural model to account for demographic variations via adversarial training. In experiments on four English-language social media datasets, we find that classification performance improves when adapting for user factors.

pdf bib
Overview of the Fourth Social Media Mining for Health (SMM4H) Shared Tasks at ACL 2019
Davy Weissenbacher | Abeed Sarker | Arjun Magge | Ashlynn Daughton | Karen O’Connor | Michael J. Paul | Graciela Gonzalez-Hernandez
Proceedings of the Fourth Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task

The number of users of social media continues to grow, with nearly half of adults worldwide and two-thirds of all American adults using social networking. Advances in automated data processing, machine learning and NLP present the possibility of utilizing this massive data source for biomedical and public health applications, if researchers address the methodological challenges unique to this media. We present the Social Media Mining for Health Shared Tasks collocated with the ACL at Florence in 2019, which address these challenges for health monitoring and surveillance, utilizing state of the art techniques for processing noisy, real-world, and substantially creative language expressions from social media users. For the fourth execution of this challenge, we proposed four different tasks. Task 1 asked participants to distinguish tweets reporting an adverse drug reaction (ADR) from those that do not. Task 2, a follow-up to Task 1, asked participants to identify the span of text in tweets reporting ADRs. Task 3 is an end-to-end task where the goal was to first detect tweets mentioning an ADR and then map the extracted colloquial mentions of ADRs in the tweets to their corresponding standard concept IDs in the MedDRA vocabulary. Finally, Task 4 asked participants to classify whether a tweet contains a personal mention of one’s health, a more general discussion of the health issue, or is an unrelated mention. A total of 34 teams from around the world registered and 19 teams from 12 countries submitted a system run. We summarize here the corpora for this challenge which are freely available at https://competitions.codalab.org/competitions/22521, and present an overview of the methods and the results of the competing systems.

pdf bib
Evaluating Topic Quality with Posterior Variability
Linzi Xing | Michael J. Paul | Giuseppe Carenini
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Probabilistic topic models such as latent Dirichlet allocation (LDA) are popularly used with Bayesian inference methods such as Gibbs sampling to learn posterior distributions over topic model parameters. We derive a novel measure of LDA topic quality using the variability of the posterior distributions. Compared to several existing baselines for automatic topic evaluation, the proposed metric achieves state-of-the-art correlations with human judgments of topic quality in experiments on three corpora. We additionally demonstrate that topic quality estimation can be further improved using a supervised estimator that combines multiple metrics.

pdf bib
Neural Temporality Adaptation for Document Classification: Diachronic Word Embeddings and Domain Adaptation Models
Xiaolei Huang | Michael J. Paul
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Language usage can change across periods of time, but document classifiers models are usually trained and tested on corpora spanning multiple years without considering temporal variations. This paper describes two complementary ways to adapt classifiers to shifts across time. First, we show that diachronic word embeddings, which were originally developed to study language change, can also improve document classification, and we show a simple method for constructing this type of embedding. Second, we propose a time-driven neural classification model inspired by methods for domain adaptation. Experiments on six corpora show how these methods can make classifiers more robust over time.

pdf bib
A Resource-Free Evaluation Metric for Cross-Lingual Word Embeddings Based on Graph Modularity
Yoshinari Fujinuma | Jordan Boyd-Graber | Michael J. Paul
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Cross-lingual word embeddings encode the meaning of words from different languages into a shared low-dimensional space. An important requirement for many downstream tasks is that word similarity should be independent of language—i.e., word vectors within one language should not be more similar to each other than to words in another language. We measure this characteristic using modularity, a network measurement that measures the strength of clusters in a graph. Modularity has a moderate to strong correlation with three downstream tasks, even though modularity is based only on the structure of embeddings and does not require any external resources. We show through experiments that modularity can serve as an intrinsic validation metric to improve unsupervised cross-lingual word embeddings, particularly on distant language pairs in low-resource settings.

pdf bib
Analyzing Bayesian Crosslingual Transfer in Topic Models
Shudong Hao | Michael J. Paul
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We introduce a theoretical analysis of crosslingual transfer in probabilistic topic models. By formulating posterior inference through Gibbs sampling as a process of language transfer, we propose a new measure that quantifies the loss of knowledge across languages during this process. This measure enables us to derive a PAC-Bayesian bound that elucidates the factors affecting model quality, both during training and in downstream applications. We provide experimental validation of the analysis on a diverse set of five languages, and discuss best practices for data collection and model design based on our analysis.

2018

pdf bib
Learning Multilingual Topics from Incomparable Corpora
Shudong Hao | Michael J. Paul
Proceedings of the 27th International Conference on Computational Linguistics

Multilingual topic models enable crosslingual tasks by extracting consistent topics from multilingual corpora. Most models require parallel or comparable training corpora, which limits their ability to generalize. In this paper, we first demystify the knowledge transfer mechanism behind multilingual topic models by defining an alternative but equivalent formulation. Based on this analysis, we then relax the assumption of training data required by most existing models, creating a model that only requires a dictionary for training. Experiments show that our new method effectively learns coherent multilingual topics from partially and fully incomparable corpora with limited amounts of dictionary resources.

pdf bib
Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop & Shared Task
Graciela Gonzalez-Hernandez | Davy Weissenbacher | Abeed Sarker | Michael Paul
Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop & Shared Task

pdf bib
Overview of the Third Social Media Mining for Health (SMM4H) Shared Tasks at EMNLP 2018
Davy Weissenbacher | Abeed Sarker | Michael J. Paul | Graciela Gonzalez-Hernandez
Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop & Shared Task

The goals of the SMM4H shared tasks are to release annotated social media based health related datasets to the research community, and to compare the performances of natural language processing and machine learning systems on tasks involving these datasets. The third execution of the SMM4H shared tasks, co-hosted with EMNLP-2018, comprised of four subtasks. These subtasks involve annotated user posts from Twitter (tweets) and focus on the (i) automatic classification of tweets mentioning a drug name, (ii) automatic classification of tweets containing reports of first-person medication intake, (iii) automatic classification of tweets presenting self-reports of adverse drug reaction (ADR) detection, and (iv) automatic classification of vaccine behavior mentions in tweets. A total of 14 teams participated and 78 system runs were submitted (23 for task 1, 20 for task 2, 18 for task 3, 17 for task 4).

pdf bib
Lessons from the Bible on Modern Topics: Low-Resource Multilingual Topic Model Evaluation
Shudong Hao | Jordan Boyd-Graber | Michael J. Paul
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Multilingual topic models enable document analysis across languages through coherent multilingual summaries of the data. However, there is no standard and effective metric to evaluate the quality of multilingual topics. We introduce a new intrinsic evaluation of multilingual topic models that correlates well with human judgments of multilingual topic coherence as well as performance in downstream applications. Importantly, we also study evaluation for low-resource languages. Because standard metrics fail to accurately measure topic quality when robust external resources are unavailable, we propose an adaptation model that improves the accuracy and reliability of these metrics in low-resource settings.

pdf bib
Examining Temporality in Document Classification
Xiaolei Huang | Michael J. Paul
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Many corpora span broad periods of time. Language processing models trained during one time period may not work well in future time periods, and the best model may depend on specific times of year (e.g., people might describe hotels differently in reviews during the winter versus the summer). This study investigates how document classifiers trained on documents from certain time intervals perform on documents from other time intervals, considering both seasonal intervals (intervals that repeat across years, e.g., winter) and non-seasonal intervals (e.g., specific years). We show experimentally that classification performance varies over time, and that performance can be improved by using a standard domain adaptation approach to adjust for changes in time.

2017

pdf bib
Incorporating Metadata into Content-Based User Embeddings
Linzi Xing | Michael J. Paul
Proceedings of the 3rd Workshop on Noisy User-generated Text

Low-dimensional vector representations of social media users can benefit applications like recommendation systems and user attribute inference. Recent work has shown that user embeddings can be improved by combining different types of information, such as text and network data. We propose a data augmentation method that allows novel feature types to be used within off-the-shelf embedding models. Experimenting with the task of friend recommendation on a dataset of 5,019 Twitter users, we show that our approach can lead to substantial performance gains with the simple addition of network and geographic features.

pdf bib
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Lucia Specia | Matt Post | Michael Paul
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

pdf bib
Feature Selection as Causal Inference: Experiments with Text Classification
Michael J. Paul
Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017)

This paper proposes a matching technique for learning causal associations between word features and class labels in document classification. The goal is to identify more meaningful and generalizable features than with only correlational approaches. Experiments with sentiment classification show that the proposed method identifies interpretable word associations with sentiment and improves classification performance in a majority of cases. The proposed feature selection method is particularly effective when applied to out-of-domain data.

2016

pdf bib
Identifying and Categorizing Disaster-Related Tweets
Kevin Stowe | Michael J. Paul | Martha Palmer | Leysia Palen | Kenneth Anderson
Proceedings of The Fourth International Workshop on Natural Language Processing for Social Media

pdf bib
Selecting Syntactic, Non-redundant Segments in Active Learning for Machine Translation
Akiva Miura | Graham Neubig | Michael Paul | Satoshi Nakamura
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2015

pdf bib
Sprite: Generalizing Topic Models with Structured Priors
Michael J. Paul | Mark Dredze
Transactions of the Association for Computational Linguistics, Volume 3

We introduce Sprite, a family of topic models that incorporates structure into model priors as a function of underlying components. The structured priors can be constrained to model topic hierarchies, factorizations, correlations, and supervision, allowing Sprite to be tailored to particular settings. We demonstrate this flexibility by constructing a Sprite-based model to jointly infer topic hierarchies and author perspective, which we apply to corpora of political debates and online reviews. We show that the model learns intuitive topics, outperforming several other topic models at predictive tasks.

2013

pdf bib
Drug Extraction from the Web: Summarizing Drug Experiences with Multi-Dimensional Topic Models
Michael J. Paul | Mark Dredze
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Separating Fact from Fear: Tracking Flu Infections on Twitter
Alex Lamb | Michael J. Paul | Mark Dredze
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2012

pdf bib
Unsupervised Part-of-Speech Tagging in Noisy and Esoteric Domains With a Syntactic-Semantic Bayesian HMM
William M. Darling | Michael J. Paul | Fei Song
Proceedings of the Workshop on Semantic Analysis in Social Media

pdf bib
The IWSLT 2011 Evaluation Campaign on Automatic Talk Translation
Marcello Federico | Sebastian Stüker | Luisa Bentivogli | Michael Paul | Mauro Cettolo | Teresa Herrmann | Jan Niehues | Giovanni Moretti
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

pdf bib
Implicitly Intersecting Weighted Automata using Dual Decomposition
Michael J. Paul | Jason Eisner
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Mixed Membership Markov Models for Unsupervised Conversation Modeling
Michael J. Paul
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

2011

pdf bib
Translation Quality Indicators for Pivot-based Statistical MT
Michael Paul | Eiichiro Sumita
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
Dialect Translation: Integrating Bayesian Co-segmentation Models with Pivot-based SMT
Michael Paul | Andrew Finch | Paul R. Dixon | Eiichiro Sumita
Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties

2010

pdf bib
Summarizing Contrastive Viewpoints in Opinionated Text
Michael Paul | ChengXiang Zhai | Roxana Girju
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

pdf bib
Integration of Multiple Bilingually-Learned Segmentation Schemes into Statistical Machine Translation
Michael Paul | Andrew Finch | Eiichiro Sumita
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

2009

pdf bib
Topic Modeling of Research Fields: An Interdisciplinary Perspective
Michael Paul | Roxana Girju
Proceedings of the International Conference RANLP-2009

pdf bib
Cross-Cultural Analysis of Blogs and Forums with Mixed-Collection Topic Models
Michael Paul | Roxana Girju
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

pdf bib
NICT@WMT09: Model Adaptation and Transliteration for Spanish-English SMT
Michael Paul | Andrew Finch | Eiichiro Sumita
Proceedings of the Fourth Workshop on Statistical Machine Translation

pdf bib
Mining the Web for Reciprocal Relationships
Michael Paul | Roxana Girju | Chen Li
Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009)

pdf bib
On the Importance of Pivot Language Selection for Statistical Machine Translation
Michael Paul | Hirofumi Yamamoto | Eiichiro Sumita | Satoshi Nakamura
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers

2008

pdf bib
Multilingual Mobile-Phone Translation Services for World Travelers
Michael Paul | Hideo Okuma | Hirofumi Yamamoto | Eiichiro Sumita | Shigeki Matsuda | Tohru Shimizu | Satoshi Nakamura
Coling 2008: Companion volume: Demonstrations

2006

pdf bib
Exploiting Variant Corpora for Machine Translation
Michael Paul | Eiichiro Sumita
Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers

2004

pdf bib
Example-based Rescoring of Statistical Machine Translation Output
Michael Paul | Eiichiro Sumita | Seiichi Yamamoto
Proceedings of HLT-NAACL 2004: Short Papers

2003

pdf bib
A corpus-centered approach to spoken language translation
Eiichiro Sumita | Yasuhiro Akiba | Takao Doi | Andrew Finch | Kenji Imamura | Michael Paul | Mitsuo Shimohata | Taro Watanabe
10th Conference of the European Chapter of the Association for Computational Linguistics

2002

pdf bib
Corpus-based Generation of Numeral Classifier using Phrase Alignment
Michael Paul | Eiichiro Sumita | Seiichi Yamamoto
COLING 2002: The 19th International Conference on Computational Linguistics

2001

pdf bib
Integration of Referential Scope Limitations into Japanese Pronoun Resolution
Michael Paul | Eiichiro Sumita
Proceedings of the Second SIGdial Workshop on Discourse and Dialogue

1999

pdf bib
Corpus-Based Anaphora Resolution Towards Antecedent Preference
Michael Paul | Kazuhide Yamamoto | Eiichiro Sumita
Coreference and Its Applications