SODA:Service Oriented Domain Adaptation Architecture for Microblog Categorization

We demonstrate SODA (Service Oriented Domain Adaptation) for efﬁcient and scalable cross-domain microblog categorization which works on the principle of transfer learning. It is developed on a novel similarity-based iterative domain adaptation algorithm while extended with features such as active learning and interactive GUI to be used by business professionals. SODA demonstrates ef-ﬁcient classiﬁcation accuracy on new collections while minimizing and sometimes eliminating the need for expensive data labeling efforts. SODA also implements an active learning (AL) technique to select informative instances from the new collection to seek annotations, if a small amount of labeled data is required by the adaptation algorithm.


Introduction
Online social media, such as Twitter.com, have become the de facto standard for sharing information, thoughts, ideas, personal feelings, daily happenings etc. which essentially led research and development in the field of social media analytics to flourish. Social media analytics provide actionable insights to business by analyzing huge amount of user generated content (UGC) (Sriram et al., 2010;Jo and Oh, 2011;He et al., 2012;Si et al., 2013;Nakov et al., 2013). Sentiment categorization, one of the common social media analytics task, segregates a collection of UGC into different buckets with positive, negative or neutral orientation (Liu and Zhang, 2012;* Work done at Xerox Research Centre India Thelwall et al., 2011;Bollen et al., 2009). This information is used to aggregate statistics and identify trends which are helpful for many applications viz. Customer Care, Product Marketing, User Studies.
Supervised machine learning (ML) techniques such as text categorization have played a key enabler role to classify microblogs into sentiment categories (Pang and Lee, 2008;Tan et al., 2009;Go et al., 2009;Fernández et al., 2014). These are trained on a fraction of annotated data as per client provided label set e.g. {positive, negative, and neu-tral} for a product/service/domain 1 . One of the obstacles towards rapid adoption of such systems is requirement of labeled tweets for developing MLbased models as it requires extensive human labeling efforts. Additionally, need of manual labeling slows down the process of categorization on high velocity social media which requires fast analytic insights. From our conversations with business professionals, we derived the need of a practical solution which would help them scale up across hundreds of collections and domains without the overhead of annotating data and building models from scratch every time for a new collection.
In this paper, we demonstrate Service Oriented Domain Adaptation (SODA) which offers social media analytics as-a-service to the users. Specifically, it provides sentiment categorization as-aservice that allows users to efficiently analyze comments from any new collection without the over- head of manual annotations or re-training models. It thus enables faster wide-scale analysis within and across different domains/industries such as telecom, healthcare, finance etc. SODA is based on an iterative ensemble based adaptation technique (Bhatt et al., 2015) which gradually transfers knowledge from the source to the new target collection while being cognizant of similarity between the two collections. It has been extensively evaluated by business professionals in a user-trial and on a benchmark dataset. Figure 1 illustrates the architecture of SODA comprising three primary modules, 1) similarity, 2) domain adaptation, and 3) active learning. The first two modules use unlabeled data from the new collection while the optional third module helps in creating labeled data for enhanced classification performance. These modules are explained below.

Similarity
In social media analytics, especially for sentiment categorization, there exist numerous collections about different products or services where labeled data is available and thus can be used to adapt to a new unlabeled collection. Given a target collection, the key question is to identify the best possible source collection to adapt from. The similarity module in SODA identifies the best adaptable source collection based on the similarity between the source and target collections. This is based on the observations from existing literature (Bhatt et al., 2015;Blitzer et al., 2007) which suggest that if the source and target collections are similar, the adaptation performance tends to be better than if the two collections are dissimilar. The similarity module in SODA is capable of computing different kinds of lexical, syntactic, and semantic similarities between unlabeled target and labeled source collections. For this demonstration on sentiment categorization from social media data, it measures cosine similarity between the comments in each collection and computes sim as the similarity score.

Domain Adaptation
The heart of SODA is the adaptation module that works on two principles, generalization and adaptation. During generalization, it learns shared common representation (Blitzer et al., 2007;Ji et al., 2011;Pan et al., 2010) which minimizes the divergence between two collections. We leverage one of the widely used structural correspondence learning (SCL) approach (Blitzer et al., 2007) to compute shared representations. The idea adhered here is that a model learned on the shared feature representation using labeled data from the source collection will also generalize well on the target collection. Towards this, we learn a model (C S ) on the shared feature representation from the source collection, referred to as "source classifier". C S is then used to predict labels for the pool of unlabeled instances from the target collection, referred to as P u , using the shared representations. All instances in P u which are predicted with a confidence (α 1 ) higher than a predefined threshold (θ 1 ) are moved to the pool of pseudo-labeled target instances, referred to as P s . We now learn a target domain model C T on P s using the target specific representation, referred to as "target classifier".
C T captures a separate view of the target instances than the shared representation and hence brings in discriminating target specific information which is useful for categorization in target collection. For further adaptation, the source (C S ) and target (C T ) classifiers are combined in a weighted ensemble (E) with w s and w t as the corresponding weights and iterate over the remaining unlabeled instances in P u . In each iteration, the ensemble processes the remaining instances and iteratively adds confidently predicted instances to P s which are used to re-train/update C T . This iterative process continues till all instances in P u are confidently labeled or a maximum number of iterations is reached. Transfer occurs within the ensemble where the source classifier progressively facilitates the learning of tar-78 The weights of the individual classifiers are updated, as a function of error (I(·)) and the similarity (sim) between the collections, which gradually shift the emphasis from source to the target classifier. Finally, the ensemble is used to predict labels for future unseen instances in the target collection. Algorithm 1 summarizes our approach (refer (Bhatt et al., 2015) for more details).

Active Learning
SODA also implements an active learning module to allow users to annotate a few selected informative comments from the target collection. These comments are selected using cross entropy difference (CED) (Axelrod et al., 2011) such that the difference with source collection and the similarity with target collection is maximized. It selects comment(s) from target collection that have low CED score i.e. comments that have high entropy with respect to source H S (·) and low entropy with respect to target collection H T (·) as in Equation (1).
Note, this active learning module is optional and should be used when the adaptation performance with unlabeled instances is not satisfactory. More and more instances can be annotated in multiple rounds till a satisfactory performance is achieved. These annotated instances are used to build a stronger target classifier for the ensemble based adaptation algorithm.

Design and Internals
Figure 2(a) illustrates the interactive user interface (UI) of SODA where one can select a new target collection for the analysis task (i.e. sentiment categorization). For a new target collection, it identifies relevant adaptable source collections based on their similarity. One can select any of the candidate source collections (selected collection highlighted in Figure 2(a)) and adapt. Figure 2(b) shows the performance report along with the predicted comments from the target collection. User evaluates the adaptation performance in unlabeled target collection by analyzing the predicted comments and decides whether to annotate additional comments in Figure 3: The effect of labeled comments on the performance while adapting from Coll-1 → Coll-6.
target collection? If yes, Figure 2(c) lists a few informative comments selected using the active learning module to seek annotations. One can mark these comments as positive, negative or neutral and subsequently adapt using these labeled instances from the target collection. Figure 2(b) also shows the adaptation performance with a few labeled instances in the target collection. One can continue annotating more instances in the target collection until satisfactory performance is achieved. For more detailed demonstration, please refer to the video. 2 The interactive UI of SODA is developed using Ruby on Rails framework. All collections are managed in MySQL server. All three modules in SODA fetch data from the server and write the output back to the server. All modules work in real time enabling the system to be highly responsive to the user. The application is hosted on Amazon AWS as RESTful web services using Java Jersey (Tomcat server) that act as a bridge between the UI and back end.

User Trial & Experimental Results
To evaluate the overall experience, a user trial was conducted where several business professionals provided feedback on SODA. The objective was to evaluate the overall usability, reduction in required efforts, and the performance on the new target collections. The overall evaluation rated SODA 5 on usability and 4 for reduction in efforts (1 being worst & 5 being the best). Table 1 reports the classification accuracy of SODA with few labeled comments from the target collection (ranging from 0 to 100). It also reports the performance of the in-domain classifier which is trained and tested on data from the same collection. Coll-1 to Coll-8 refer to collections pertaining to marketing & sales, comcast support, 2 https://www.youtube.com/watch?v= zKnP5QEHVAE  DirectTV support, ASUS, Johnson & Johnson CSAT, Apple iPhone6, and HUWAEI respectively. Figure  3 compares the effect of adding labeled comments in batches of 25 comments at-a-time. When there is no labeled data in the target collection, in-domain classifier can not be applied while SODA still yields good classification accuracy. Moreover, SODA consistently performs better than the in-domain classifier with same amount of labeled data. We also evaluated the performance of domain adaptation (DA) module of SODA on the Amazon review dataset (Blitzer et al., 2007) which is a benchmark dataset for sentiment categorization. It has 4 domains, namely, books(B), dvds(D), electronics(E), and kitchen(K) each with 2000 reviews divided equally into positive and negative reviews. Table 2 shows that DA module of SODA outperforms 1) a widely used domain adaptation technique , namely, structural correspondence learning (SCL) (Blitzer et al., 2007;Blitzer et al., 2006), 2) the baseline (BL) where a classifier trained on one domain is applied on another domain, and 3) the in-domain classifier. Note that in Table 2, the performance of DA module of SODA is reported when it does not use any labeled instances from the target domain.

Conclusion
We demonstrated SODA for efficient microblog categorization on new social media collections with minimum (or sometimes no) need of manual annotations; thus, enabling faster and efficient analytics.