Shimei Pan


2022

pdf bib
Incorporating LIWC in Neural Networks to Improve Human Trait and Behavior Analysis in Low Resource Scenarios
Isil Yakut Kilic | Shimei Pan
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Psycholinguistic knowledge resources have been widely used in constructing features for text-based human trait and behavior analysis. Recently, deep neural network (NN)-based text analysis methods have gained dominance due to their high prediction performance. However, NN-based methods may not perform well in low resource scenarios where the ground truth data is limited (e.g., only a few hundred labeled training instances are available). In this research, we investigate diverse methods to incorporate Linguistic Inquiry and Word Count (LIWC), a widely-used psycholinguistic lexicon, in NN models to improve human trait and behavior analysis in low resource scenarios. We evaluate the proposed methods in two tasks: predicting delay discounting and predicting drug use based on social media posts. The results demonstrate that our methods perform significantly better than baselines that use only LIWC or only NN-based feature learning methods. They also performed significantly better than published results on the same dataset.

2021

pdf bib
Incorporating medical knowledge in BERT for clinical relation extraction
Arpita Roy | Shimei Pan
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

In recent years pre-trained language models (PLM) such as BERT have proven to be very effective in diverse NLP tasks such as Information Extraction, Sentiment Analysis and Question Answering. Trained with massive general-domain text, these pre-trained language models capture rich syntactic, semantic and discourse information in the text. However, due to the differences between general and specific domain text (e.g., Wikipedia versus clinic notes), these models may not be ideal for domain-specific tasks (e.g., extracting clinical relations). Furthermore, it may require additional medical knowledge to understand clinical text properly. To solve these issues, in this research, we conduct a comprehensive examination of different techniques to add medical knowledge into a pre-trained BERT model for clinical relation extraction. Our best model outperforms the state-of-the-art systems on the benchmark i2b2/VA 2010 clinical relation extraction dataset.

2019

pdf bib
Predicting Malware Attributes from Cybersecurity Texts
Arpita Roy | Youngja Park | Shimei Pan
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Text analytics is a useful tool for studying malware behavior and tracking emerging threats. The task of automated malware attribute identification based on cybersecurity texts is very challenging due to a large number of malware attribute labels and a small number of training instances. In this paper, we propose a novel feature learning method to leverage diverse knowledge sources such as small amount of human annotations, unlabeled text and specifications about malware attribute labels. Our evaluation has demonstrated the effectiveness of our method over the state-of-the-art malware attribute prediction systems.

pdf bib
Supervising Unsupervised Open Information Extraction Models
Arpita Roy | Youngja Park | Taesung Lee | Shimei Pan
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

We propose a novel supervised open information extraction (Open IE) framework that leverages an ensemble of unsupervised Open IE systems and a small amount of labeled data to improve system performance. It uses the outputs of multiple unsupervised Open IE systems plus a diverse set of lexical and syntactic information such as word embedding, part-of-speech embedding, syntactic role embedding and dependency structure as its input features and produces a sequence of word labels indicating whether the word belongs to a relation, the arguments of the relation or irrelevant. Comparing with existing supervised Open IE systems, our approach leverages the knowledge in existing unsupervised Open IE systems to overcome the problem of insufficient training data. By employing multiple unsupervised Open IE systems, our system learns to combine the strength and avoid the weakness in each individual Open IE system. We have conducted experiments on multiple labeled benchmark data sets. Our evaluation results have demonstrated the superiority of the proposed method over existing supervised and unsupervised models by a significant margin.

2018

pdf bib
UMBC at SemEval-2018 Task 8: Understanding Text about Malware
Ankur Padia | Arpita Roy | Taneeya Satyapanich | Francis Ferraro | Shimei Pan | Youngja Park | Anupam Joshi | Tim Finin
Proceedings of the 12th International Workshop on Semantic Evaluation

We describe the systems developed by the UMBC team for 2018 SemEval Task 8, SecureNLP (Semantic Extraction from CybersecUrity REports using Natural Language Processing). We participated in three of the sub-tasks: (1) classifying sentences as being relevant or irrelevant to malware, (2) predicting token labels for sentences, and (4) predicting attribute labels from the Malware Attribute Enumeration and Characterization vocabulary for defining malware characteristics. We achieve F1 score of 50.34/18.0 (dev/test), 22.23 (test-data), and 31.98 (test-data) for Task1, Task2 and Task2 respectively. We also make our cybersecurity embeddings publicly available at http://bit.ly/cyber2vec.

2017

pdf bib
Multi-View Unsupervised User Feature Embedding for Social Media-based Substance Use Prediction
Tao Ding | Warren K. Bickel | Shimei Pan
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

In this paper, we demonstrate how the state-of-the-art machine learning and text mining techniques can be used to build effective social media-based substance use detection systems. Since a substance use ground truth is difficult to obtain on a large scale, to maximize system performance, we explore different unsupervised feature learning methods to take advantage of a large amount of unsupervised social media data. We also demonstrate the benefit of using multi-view unsupervised feature learning to combine heterogeneous user information such as Facebook “likes” and “status updates” to enhance system performance. Based on our evaluation, our best models achieved 86% AUC for predicting tobacco use, 81% for alcohol use and 84% for illicit drug use, all of which significantly outperformed existing methods. Our investigation has also uncovered interesting relations between a user’s social media behavior (e.g., word usage) and substance use.

2016

pdf bib
Personalized Emphasis Framing for Persuasive Message Generation
Tao Ding | Shimei Pan
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

2015

pdf bib
Using Personal Traits For Brand Preference Prediction
Chao Yang | Shimei Pan | Jalal Mahmud | Huahai Yang | Padmini Srinivasan
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

2014

pdf bib
Active Learning with Constrained Topic Model
Yi Yang | Shimei Pan | Doug Downey | Kunpeng Zhang
Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces

2005

pdf bib
Instance-based Sentence Boundary Determination by Optimization for Natural Language Generation
Shimei Pan | James Shaw
Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05)

2002

pdf bib
Designing a Speech Corpus for Instance-based Spoken Language Generation
Shimei Pan | Wubin Weng
Proceedings of the International Natural Language Generation Conference

2000

pdf bib
Modeling Local Context for Pitch Accent Prediction
Shimei Pan | Julia Hirschberg
Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics

1999

pdf bib
Word Informativeness and Automatic Pitch Accent Modeling
Shimei Pan | Kathleen R. McKeown
1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora

1998

pdf bib
Evaluating Response Strategies in a Web-Based Spoken Dialogue Agent
Diane J. Litman | Shimei Pan | Marilyn A. Walker
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 2

pdf bib
Learning Intonation Rules for Concept to Speech Generation
Shimei Pan | Kathleen McKeown
36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 2

pdf bib
Evaluating Response Strategies in a Web-Based Spoken Dialogue Agent
Diane J. Litman | Shimei Pan | Marilyn A. Walker
COLING 1998 Volume 2: The 17th International Conference on Computational Linguistics

pdf bib
Learning Intonation Rules for Concept to Speech Generation
Shimei Pan | Kathleen McKeown
COLING 1998 Volume 2: The 17th International Conference on Computational Linguistics

1997

pdf bib
Language Generation for Multimedia Healthcare Briefings
Kathleen R. McKeown | Desmond A. Jordan | Shimei Pan | James Shaw | Barry A. Allen
Fifth Conference on Applied Natural Language Processing

pdf bib
Integrating Language Generation with Speech Synthesis in a Concept to Speech System
Shimei Pan | Kathleen R. McKeown
Concept to Speech Generation Systems

1992

pdf bib
Knowledge Acquisition and Chinese Parsing Based on Corpus
Chunfa Yuan | Changning Huang | Shimei Pan
COLING 1992 Volume 4: The 14th International Conference on Computational Linguistics