Raymond Lau

2016

pdf bib abs
Exploring Topic Discriminating Power of Words in Latent Dirichlet Allocation
Kai Yang | Yi Cai | Zhenhong Chen | Ho-fung Leung | Raymond Lau
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Latent Dirichlet Allocation (LDA) and its variants have been widely used to discover latent topics in textual documents. However, some of topics generated by LDA may be noisy with irrelevant words scattering across these topics. We name this kind of words as topic-indiscriminate words, which tend to make topics more ambiguous and less interpretable by humans. In our work, we propose a new topic model named TWLDA, which assigns low weights to words with low topic discriminating power (ability). Our experimental results show that the proposed approach, which effectively reduces the number of topic-indiscriminate words in discovered topics, improves the effectiveness of LDA.

2014

pdf bib abs
Clustering tweets usingWikipedia concepts
Guoyu Tang | Yunqing Xia | Weizhi Wang | Raymond Lau | Fang Zheng
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Two challenging issues are notable in tweet clustering. Firstly, the sparse data problem is serious since no tweet can be longer than 140 characters. Secondly, synonymy and polysemy are rather common because users intend to present a unique meaning with a great number of manners in tweets. Enlightened by the recent research which indicates Wikipedia is promising in representing text, we exploit Wikipedia concepts in representing tweets with concept vectors. We address the polysemy issue with a Bayesian model, and the synonymy issue by exploiting the Wikipedia redirections. To further alleviate the sparse data problem, we further make use of three types of out-links in Wikipedia. Evaluation on a twitter dataset shows that the concept model outperforms the traditional VSM model in tweet clustering.