Overview of Topic-based Chinese Message Polarity Classification in SIGHAN 2015

This paper presents the overview of Topic-based Chinese Message Polarity Classification in SIGHAN 2015 bake-off. Topic-based message polarity classification plays an important role in sentiment analysis, information extraction, event tracking, and other related research areas. This task is designed to evaluate the techniques for Chinese message polarity classification towards a given topic. The task organizers manually constructed 25 topics together with 24,374 corresponding messages which were annotated to construct the training and testing datasets. The evaluation results achieved by the participa-tors provide good suggestion for the future research.


Introduction
Recently, with the popularity of social media, such as microblogs, weblogs, and discussion forums, interests in analyzing sentiment and mining opinions in user-generated contents has grown rapidly. There are much work focusing on the overall polarity identification of a sentence, paragraph, or the document (Wiebe et al., 2005;Hu and Liu, 2004;Pang et al., 2002), without the consideration of the message polarity classification towards a specific topic. To this end, SIGHAN 2015 proposes a Topic-based Chinese Message Polarity Classification (TCMPC) task, which targets on classifying the polarity to the given topic in Chinese messages.
The task of Topic-based Chinese Message Polarity Classification is motivated by the need of microblog search where users attempt to discover popular sentiments on a topic. Similar pilot task has been proposed in the Chinese Opinion Analysis Evaluation (COAE) since 2008 (Zhao et al., 2008;Xu et al., 2009), which aimed at the document level based on blog corpus. Generally speaking, the mainstream techniques for COAE 2008 followed the thoughts of information retrieval, and adopted two-step approaches that first retrieved the documents relevant to the query, i.e. topic, and then identify the polarity for those retrieved documents. (Xu et al., 2009) Currently, as the social media become popular, much research turned towards on short texts, e.g. messages. The task of Topic-based Chinese Message Polarity Classification in SIGHAN 8 bakeoff is designed on the basis of task of Sentiment Analysis in Twitter in SemEval 2015 workshop. (Rosenthal et al., 2015) In this task, the organizers provide a collection of messages corresponding to a given topic and restricted sentiment resources which contain partial list of sentiment words. Participants are required to classify the topical messages into positive, negative, or neutral. This task is similar to COAE 2008 and 2009, but it focuses on sentiment polarity classification in short texts.
In the remainder of this paper, we first describe the task of topic-based message polarity classification. We then describe the process of data collection and annotation. We list and briefly describe the participating systems, and the results in our evaluation. Finally, we conclude and review the evaluation for future research.

Task Description
Topic-Based Chinese Message Polarity Classification is motivated by the function of microblog search where users attempt to discover popular sentiments towards on a topic.
Organizers collect messages from Chinese microblog platforms 1 according to the predefined topics. Example 1 gives the sample of a topic together with the messages.
<Topic> "iphone6" (TopicID 0) </Topic> Example 1: Sample of input. The participants are required to classify whether the message is of positive, negative, or neutral sentiment towards the given topic. For messages expressing both a positive and negative sentiment towards the topic, whichever is the stronger sentiment should be chosen. The analysis results are defined in the following format: <runID; topicID; evalID; mesID; Polarity>.
 runID is the team name of each participant;  topicID is the name of each topic;  evalID denotes different runs for the team;  mesID is message ID;  Polarity can be predicted sentiment polarity of topic (1 for positive, -1 for negative and 0 for neutral). The first run by team 1 of sample 1 is expected to be returned as follows: <1; 0; 1; M15113801; 0> <1; 0; 1; M15113803; 1> <1; 0; 1; M15113805; -1> In this task, the participants are required to submit two kinds of results based on: (1) restricted resource for fair comparison, e.g. sentiment lexicon, corpus; and (2) unrestricted resource. We believe that a freely available, annotated corpus that can be used as a common testbed is needed in order to promote research that will lead to a better understanding of how opinions are expressed in microblogs.

Datasets
In this section, we will describe our data collection and annotation.

Data Collection
We first identify the popular topics that widely arouse people's comments and sentiments from the newspapers. For this purpose, we utilized con-1 http://weibo.com 2 http://www.datatang.com/data/44317/ ventional topic detection techniques for detecting hot topics over a three months spanning from January 2015 to March 2015. Then, we also did some manual selection for the topics. First, we excluded topics that were incomprehensible, ambiguous, or were too general. Second, we removed microblogs that were just mentioning the topic, but not really about the topic, e.g. advertisements.
Given the set of identified topics, we further crawled the microblogs from the Chinese microblog platforms during the same time period that involved the topics. There were 24,374 messages among 25 topics in total, and the topics of test data were different from training data. In practice, most of the collected microblogs were likely to concentrate in the neutral class. To avoid class imbalance, we removed messages without sentiment-bearing words using NTUSD 2 as the repository of sentiment words.

Annotation
Three annotators were trained to annotate the dataset independently. Given a collection of messages, the annotation task is to label each message as positive, negative, or neutral with respect to the given topic. To avoid conflict, we pruned the messages which were classified into three categories by different annotators.
The Kappa coefficient indicating agreement was 0.8832 for the positive/negative classification and was 0.7829 for fine-grained annotation, where the annotator should annotate the stronger sentiment when both positive and negative sentiments towards the topic. Some statistics of the annotation results are displayed in Table 1 and Table 2. 538 out of 4,905 messages are labeled as negative accounting for 10.97%, while 394 messages are labeled as positive accounting for 8.03% in the training set. 3639 out of 19,469 messages are labeled as negative accounting for 18.69%, while 1152 messages are labeled as positive accounting for 5.91% in the testing set.

Evaluation Metrics
In the evaluation, both the resource-restricted and resource-unrestricted runs were adopted the same metrics. The messages were categorized into three classes, i.e., to assign one of the following three labels: positive, negative or objective/neutral. We evaluated the systems in terms of precision, recall, and F1 score for predicting positive and negative messages, respectively, Then we used macro-averaged F1 score for system comparison in the evaluation.
(2)  Table 3. Table 4 showed the testing results based on restricted resource of the TCMPC task, and Table 5 showed the testing results based on unrestricted resource of the TCMPC task. In addition to precision, recall and F1, there are other fine-grained performance criteria, i.e., precision+ reflects the percentage of correct positive messages among the positive messages submitted by each team; and recall-reflects the percentage of correct negative messages submitted by each team among the negative messages in dataset.

Evaluation
For general evaluations, the team TICS-dm achieved promising results in both restricted and unrestricted resources. Their results were about 10% higher than the second ranked team. Team ZWK, NEUDM1 and NEUDM2 also achieved nearly 75% performances. In general, most of teams perform better on unrestricted resource than restricted resource.
For fine-grained evaluations, the team TICSdm performed even more outstanding than other teams, i.e., their positive results were about 30% higher than the second ranked team on unrestricted resource. The team HLT HITSZ also performed well, i.e., their positive results were about 10% higher than the third ranked team on unrestricted resource. Overall, each team performed better on negative messages than positive messages.

Conclusion
This paper provides an overview of SIGHAN 2015 Bake-off Task 2: Topic-Based Chinese Message Polarity Classification, including task design, data preparation, evaluation metrics, and performance evaluation results. The task requires each participant to submit two kinds of result based on restricted resource for fair comparison and unrestricted resource. Regardless of actual performance, all submissions contribute to the common effort to produce an effective Chinese message polarity classifier, and the individual report in the bake-off proceedings provide useful insight into Chinese language processing. We believe that a freely available, annotated corpus that can be used as a common testbed is needed in order to promote research that will lead to a better understanding of how sentiment is conveyed in microblogs. All datasets with gold standards are publicly available for research purposes.