Detecting Deceptive Opinion Spam using Linguistics, Behavioral and Statistical Modeling

With the advent of Web 2.0, consumer reviews have become an important resource for public opinion that influence our decisions over an extremely wide spectrum of daily and professional activities: e.g., where to eat, where to stay, which products to purchase, which doctors to see, which books to read, which universities to attend, and so on. Positive/negative reviews directly translate to financial gains/losses for companies. This unfortunately gives strong incentives for opinion spamming which refers to illegal human activities (e.g., writing fake reviews and giving false ratings) that try to mislead customers by promoting/demoting certain entities (e.g., products and businesses). The problem has been widely reported in the news. Despite the recent research efforts on detection, the problem is far from solved. What is worse is that opinion spamming is widespread. While credit card fraud is as rare as 0.2%, based on our research we estimated that up to 30% of the reviews on many Web sites could be fake. Thus, detecting fake reviews and opinions is a pressing and also profound issue as it is critical to ensure the trustworthiness of the information on the web. Without detecting them, the social media could become a place full of lies, fakes, and deceptions, and completely useless. Major review hosting sites and e-commerce vendors have already made some progress in detecting fake reviews. However, the task is still extremely challenging because it is very difficult to obtain large-scale ground truth samples of deceptive opinions for algorithm development and for evaluation, or to conduct large-scale domain expert evaluations. Further, in contrast to other kinds of spamming (e.g., Web and link spam, social/blog spam, email spam, etc.) opinion spam has a very unique flavor as it involves fluid sentiments of users and their evaluations. Thus, they require a very different treatment. Since our first paper in 2007 (Jindal and Liu, 2007) on the topic, our group and many other researchers have proposed several algorithms and bridged algorithmic methodologies from various scientific disciplines including computational linguistics (Ott et al., 2011), social and behavioral sciences (Jindal and Liu, 2008; Mukherjee et al., 2013a, b), machine learning, data mining and Bayesian statistics (Mukherjee et al., 2012; Fei et al., 2013; Mukherjee et al., 2013c; Li et al., 2014b; Li et al., 2014a) to solve the problem. The field of deceptive opinion spam has gained a lot of interest in communications (Hancock et al., 2008), psycholinguistics communities (Gokhman et al., 2012), and economic analysis (Wang, 2010) apart from mainstream NLP and Web mining as attested by publications in top tier venues in their respective communities. The problem has far reaching implications in various allied NLP topics including Lie Detection, Forensic Linguistics, Opinion Trust and Veracity Verification and Plagiarism Detection. However, owing to the inherent nature of the problem, a unique blend of NLP, data mining, machine learning, social, behavioral, and statistical techniques are required which many NLP researchers may not be familiar with. In this tutorial, we aim to cover the problem in its full depth and width, covering diverse algorithms that have been developed over the past 7 years. The most attractive quality of these techniques is that many of them can be adapted for cross-domain and unsupervised settings. Some of the methods are even in use by startups and established companies. Our focus is on insight and understanding, using illustrations and intuitive deductions. The goal of the tutorial is to make the inner workings of these techniques transparent, intuitive and their results interpretable.


Introduction
With the advent of Web 2.0, consumer reviews have become an important resource for public opinion that influence our decisions over an extremely wide spectrum of daily and professional activities: e.g., where to eat, where to stay, which products to purchase, which doctors to see, which books to read, which universities to attend, and so on. Positive/negative reviews directly translate to financial gains/losses for companies. This unfortunately gives strong incentives for opinion spamming which refers to illegal human activities (e.g., writing fake reviews and giving false ratings) that try to mislead customers by promoting/demoting certain entities (e.g., products and businesses). The problem has been widely reported in the news. Despite the recent research efforts on detection, the problem is far from solved. What is worse is that opinion spamming is widespread. While credit card fraud is as rare as 0.2%, based on our research we estimated that up to 30% of the reviews on many Web sites could be fake. Thus, detecting fake reviews and opinions is a pressing and also profound issue as it is critical to ensure the trustworthiness of the information on the web. Without detecting them, the social media could become a place full of lies, fakes, and deceptions, and completely useless.
Major review hosting sites and e-commerce vendors have already made some progress in detecting fake reviews. However, the task is still extremely challenging because it is very difficult to obtain large-scale ground truth samples of deceptive opinions for algorithm development and for evaluation, or to conduct large-scale domain expert evaluations. Further, in contrast to other kinds of spamming (e.g., Web and link spam, social/blog spam, email spam, etc.) opinion spam has a very unique flavor as it involves fluid sentiments of users and their evaluations. Thus, they require a very different treatment. Since our first paper in 2007 (Jindal and Liu, 2007) on the topic, our group and many other researchers have proposed several algorithms and bridged algorithmic methodologies from various scientific disciplines including computational linguistics (Ott et al., 2011), social and behavioral sciences (Jindal and Liu, 2008;Mukherjee et al., 2013a, b), machine learning, data mining and Bayesian statistics (Mukherjee et al., 2012;Fei et al., 2013;Mukherjee et al., 2013c;Li et al., 2014b;Li et al., 2014a) to solve the problem. The field of deceptive opinion spam has gained a lot of interest in communications (Hancock et al., 2008), psycholinguistics communities (Gokhman et al., 2012), and economic analysis (Wang, 2010) apart from mainstream NLP and Web mining as attested by publications in top tier venues in their respective communities. The problem has far reaching implications in various allied NLP topics including Lie Detection, Forensic Linguistics, Opinion Trust and Veracity Verification and Plagiarism Detection. However, owing to the inherent nature of the problem, a unique blend of NLP, data mining, machine learning, social, behavioral, and statistical techniques are required which many NLP researchers may not be familiar with.
In this tutorial, we aim to cover the problem in its full depth and width, covering diverse algorithms that have been developed over the past 7 years. The most attractive quality of these techniques is that many of them can be adapted for cross-domain and unsupervised settings. Some of the methods are even in use by startups and established companies. Our focus is on insight and understanding, using illustrations and intuitive deductions. The goal of the tutorial is to make the inner workings of these techniques transparent, intuitive and their results interpretable.

Content Overview
The first part of the tutorial presents the problem in its various flavors, the NLP techniques, and the algorithms motivated from social and behavioral sciences. It also presents a detailed insight into commercial vs. crowdsourced deceptive opinions using information theory and linguistics. The second section includes detailed math and algorithms for training supervised, unsupervised, semi-supervised, and partially supervised machine learning and statistical models for deceptive opinion spam detection. These algorithms allow us to work on unlabeled data which is a key aspect of the problem as generating high quality labels of fake reviews in large scale is hard if not impossible. We also discuss some new evaluation methods. Additionally, we draw connections to Authorship Attribution to discover fake reviewers with multiple accounts based on their writing styles, which is a new frontier in deceptive opinion spamming. The last part of the tutorial gives a general overview of the different applications of the methods in allied NLP problems and domains, data sources, and the limitations of the existing methods.

Instructor Biography
Arjun Mukherjee is an Assistant Professor in the Department of Computer Science at the University of Houston. He is an active researcher in the area of opinion spam, sentiment analysis and Web mining. He is the lead author behind several influential works on opinion spam research. These include group opinion spam, commercial fake review filters (e.g., Yelp), and various statistical models for detecting singular opinion spammers, burstiness patterns, and campaign. His work on opinion mining including deception detection have also received significant media attention (e.g., ACM Tech News, NYTimes, LATimes, Business Week, CNet, etc 1 ). Mukherjee has also served as program committee members of WWW, ACL, EMNLP, and IJCNLP.