Ranking Convolutional Recurrent Neural Networks for Purchase Stage Identification on Imbalanced Twitter Data

Users often use social media to share their interest in products. We propose to identify purchase stages from Twitter data following the AIDA model (Awareness, Interest, Desire, Action). In particular, we define the task of classifying the purchase stage of each tweet in a user’s tweet sequence. We introduce RCRNN, a Ranking Convolutional Recurrent Neural Network which computes tweet representations using convolution over word embeddings and models a tweet sequence with gated recurrent units. Also, we consider various methods to cope with the imbalanced label distribution in our data and show that a ranking layer outperforms class weights.


Introduction
As the use of social media grows, more users are sharing interests or experiences with products, and asking friends for information (Morris et al., 2010). Thus, social media posts can contain information useful for marketing and customer relationship management, including user behavior, opinions, and purchase interest.
In this paper, we present a ranking-based, deep learning approach to automatically identify stages in a sales process following the well-known AIDA (Awareness/Attention, Interest, Desire, and Action) model (Lewis, 1903;Dukesmith, 1904;Russell, 1921). Since we are interested in purchases, we define "Action" as buying a product. Knowledge of a user's purchase stage can help to personalize the type of advertisement a user is shown, e.g., while a user with interest may be shown information about product features by a manufacturer, * The work was performed during an internship at FX Palo Alto Laboratory  a user with the desire to purchase may be given coupons for a particular store offering the product of interest. In addition to automatically recognizing the traditional AIDA stages, we also add a class with negative sentiment, namely unhappiness of a user with a product. Given a user's tweet sequence, we define the purchase stage identification task as automatically determining for each tweet whether the user expresses interest in, wants to buy, or has recently bought a product, etc. Table 1 shows one randomly picked example for each of the purchase stages as well as for an artificial class 'N' which we use for tweets not expressing a purchase stage.
We introduce RCRNN (ranking convolutional recurrent neural network), a hierarchical neural network that uses convolution to create a tweet representation and recurrent hidden layers to represent a tweet sequence. We compare RCRNN with other possible neural network (NN) architectures and non-neural models.
A particular challenge of our dataset is class imbalance: There are much more tweets expressing none of the purchase stages than tweets expressing one of them. We investigate the use of a ranking layer in our NN and compare it against class weights for handling imbalanced data.
To sum up, our contributions are as follows: (1) We define the new task of purchase stage identification from tweets. Our results show that tweets do contain signals indicative of purchase stages. (2) We propose RCRNN, a hierarchical deep learning model to represent tweets and tweet sequences.
(3) We show that a ranking layer approach outperforms commonly used class weights for training neural networks on imbalanced data.

Related Work
An increasing amount of research is focused on social media with various classification goals. For example, Twitter tweets have been used for the prediction of movie revenues (Asur and Huberman, 2010) and stock prices (Kharratzadeh and Coates, 2012;Bollen andMao, 2011). Lassen et al. (2014) predicted quarterly iPhone sales motivated by the AIDA model, but did not model AIDA directly as we do in this paper. More related to our task is classifying whether a user has purchase intent. Vieira (2015) and Lo et al. (2016) used features from e-commerce or content discovery platforms to predict buying intentions. Manually crafted linguistic and/or statistical features have been used to predict potential purchase intent from Quora and Yahoo! Answers (Gupta et al., 2014), and to detect purchase intent in product reviews (Ramanand et al., 2010). The task of identifying purchase intent is related to our task of identifying purchase stages, but does not indicate a stage in making a purchase decision. The posts in both Quora and Yahoo! Answers, by their nature, tend to be posts by people seeking information, of which some are related to purchase decisions. And the product reviews in Ramanand et al. (2010) are more targeted towards the product being reviewed. All three tend to be less noisy than a user's tweets due in part to a smaller proportion of tangential text, such as "My brother hid my phone".
Works which use Twitter tweets as input largely employ manually-crafted linguistic and statistical features. Hollerit et al. (2013) trained different classifiers on the words and part-of-speech tags of tweets to detect whether a tweet contained "commercial intent", which includes intent to buy or sell. Mahmud et al. (2016) also used manuallycrafted features to infer potential purchase or recommendation intentions from Twitter.
Recently, convolutional and recurrent neural networks (CNN, RNN) have proven to be effective for different text processing tasks, e.g., (Kalchbrenner et al., 2014;Kim, 2014;Bahdanau et al., 2015;Cho et al., 2014;Hermann et al., 2015). They learn features automatically. Ding et al. (2015) applied a CNN to identify consumption intention from a single tweet. Korpusik et al. (2016) employed a simple average of word embeddings to model tweets and used a long short-term memory network for purchase prediction based on a user's tweet sequence. Both Ding et al. and Korpusik et al. focused on a binary classification task, rather than finer-grained multi-class AIDA purchase stages our models identify. And both works used a relatively balanced dataset, thus avoiding the difficult but more realistic classification task on strongly imbalanced data.

Purchase Stage Classification
Following the AIDA model (Lewis, 1903;Dukesmith, 1904;Russell, 1921), we regard the following purchase stages: Awareness (A), Interest (I), Desire (D) and Action ('bought' action in our case, thus we use the abbreviation B). In addition, we include a class with a negative sentiment: Unhappiness (U). We use this class for any expression of unhappiness with a product, before or after buying it. Table 1 provides examples for the different purchase stages. Although it is possible that a user may express unhappiness and an AIDA stage simultaneously, this occurred in only 15 tweets out of over 100k total. The task we focus on in this paper is purchase stage classification, i.e. distinguishing the different purchase stages for individual tweets in a given tweet sequence.

Dataset Creation
Data Collection. For a dataset, we focus on public Twitter tweets. Twitter data for purchase prediction was also collected by Korpusik et al. (2016). They used hand-crafted regular expressions to identify tweets indicating that a user may have bought or wanted a product. However, their dataset was biased towards bought/want tweets and their patterns covered only a subset of possible bought/want phrases.
To create a more "real-world" set, we scraped web sites for mobile phones, tablets and watches available in 2016, collecting 98 model names. The full product names and relatively distinct model names (e.g, 'iPad' but not 'one' as in HTC One) formed queries to the Twitter search API. The tweets were filtered for spam using the URL features from (Benevenuto et al., 2010) and spam words. User timelines for the remaining users were collected and the users filtered for spammers using all their tweets.
Annotation. Tweets containing at least one product mention were labeled with the AIDB+U purchase stages defined above, and those which do not express one of these stages were annotated with an artificial class 'N'. Two annotators were given examples of each of the AIDB+UN categories. They first individually labeled the tweets. Cohen's kappa between the annotators was 0.30. For tweets that both annotators labeled with any of AIDBU, Cohen's kappa was 0.77. In a second pass, the annotators discussed the tweets where they disagreed and agreed on a final label.
Tweet Sequences. We regard all tweets from one user as one sequence (temporally ordered). However, if the temporal distance between two successive tweets is more than two months, we split them into two sequences. This maximum distance has been chosen heuristically after a manual analysis of tweets and their time stamps.

Model
We propose to use a hierarchical NN (see Figure  1) for purchase stage identification. In our experiments, we compare its components at the different hierachy levels with alternative choices. Unlike most previous work on purchase prediction, we do not use hand-crafted features to avoid expensive data preprocessing and manual feature design.
First, we represent each word by its embedding, skipping unknown words. The embeddings have been trained with word2vec (Mikolov et al., 2013) on Twitter data (Godin et al., 2015). 1 Next, we compute a tweet representation that models word order. We apply convolutional fil- Finally, we feed the representations of tweets by a user into a sequence model, i.e. a unidirectional NN with gated recurrent units (GRU) (Cho et al., 2014). 2 Thus, the model can learn patterns across tweets, such as "a user might first express interest in a product before buying it but not vice versa".

Dealing with Imbalanced Data
The dataset statistics show that the data is highly imbalanced. Users talking about products are not necessarily interested in buying them. Instead, they might write about their experience or mention that someone else has bought a product. To cope with the imbalanced labels, we propose to use a ranking layer. In our experiments, this approach outperforms traditionally used class weights.
Class Weights. If the ground truth is a nonartificial class, the error of the model is multiplied by w > 1. With gradient descent, the parameter updates after a false negative prediction are larger, penalizing the model more. The weight w i for class i is proportional to the inverse class frequency f i : w i ∝ 1 f i . The weights are normalized so that the weight for class 'N' is 1.
Ranking Loss. dos Santos et al. (2015) introduced the following ranking loss function: (1) s θ (x) y + is the score for the correct label y + and s θ (x) c − is the score for the best competitive class c − . m + and m − are margins. The function aims to give scores greater than m + for the correct class and scores smaller than m − for the incorrect classes. The factor γ penalizes errors. 3 The function is especially suited for artificial classes (like our 'N' class) for which it might not be possible to learn a specific pattern: If y + = N , only the second summand is evaluated. During test, 'N' is only chosen if the scores for all other classes are negative. This lets the model focus on the nonartificial classes and is the reason why we investigate this loss function in the context of data which is imbalanced between AIDB+U and 'N'.

Experiments and Results
Due to the high class imbalance in our dataset, we use the macro F1 of the non-artificial classes as our evaluation measure. We implement the NNs with Theano (Theano Development Team, 2016) and the non-neural classifiers with scikitlearn (Pedregosa et al., 2011). For training the NNs, we use stochastic gradient descent and shuffle the training data at the beginning of each epoch. We apply AdaDelta as the learning rate schedule (Zeiler, 2012). The hyper-parameters (number of hidden units, number of convolutional filters, and convolutional filter widths) are optimized on dev. We apply L2 regularization with λ = 0.00001 and early-stopping on the dev set. To avoid exploding gradients, we clip the gradients at a threshold of t = 1.

Data Preprocessing
To preprocess the tweets, we apply the publicly available scripts from  4 which use twokenize (Owoputi et al., 2013) for tokenization and perform some basic cleaning steps, such as replacing URLs with a special token or normalizing elongated words. Then, we split the data by user into training, development (dev) and test sets (80,10,10%). To reduce the class imbalance, we randomly subsample 'N' tweets in the training set. Table 2 provides statistics for the final dataset.

Experiments
Baseline Models. In addition to a random guessing baseline, we use two non-neural baseline mod- 3 We set m + to 2.5 and m − to 0.5 as in (dos Santos et al., 2015) Table 3 shows that the RCRNN clearly outperforms nonneural models. Impact of RCRNN Components. We first investigate CNN against two other methods for calculating tweet representations (Table 4): (1) Averaging word embeddings (Average) (Korpusik et al., 2016;Le and Mikolov, 2014) and (2) a bidirectional GRU with attention (GRU+att). For the GRU, we use the equations provided in (Cho et al., 2014). For each intermediate hidden layer x i of the GRU, we calculate the attention weight α i with a softmax layer: where V is a parameter of the model that is initialized randomly and learned during training. We then use the weighted sum of all hidden layers as the tweet representation. GRU+att and CNN clearly outperform Average which can neither take word order into account nor focus on relevant words. Also, CNN outperforms GRU+att.
Next, we show the positive impact of GRU as a tweet sequence model by replacing it with models that do not use sequential information. In particular, we use a simple feed-forward (FF) model    Table 6: Impact of ranking layer on RCRNN (with and without a hidden layer) to predict the output label given only the current tweet representation calculated by a CNN. The results provided in Table 5 show that GRU outperforms the FF models. Thus, there is cross-tweet information which can be exploited for purchase stage prediction. Finally, we investigate ways of dealing with imbalanced data: We replace the ranking layer of RCRNN with a cross-entropy (CE) loss with and without class weights (see Section 4.1). Table 6 shows that class weights improve CE but ranking performs best. 5 Adding class weights to the baseline SVM improves the model to 46.27 on dev and 50.89 on test. The performance on dev and test are both still worse than RCRNN. Thus, our experiments do not confirm previous studies which found that SVMs were superior to NNs on imbalanced data (Chawla et al., 2004).
To sum up, we observed that convolution provided the best tweet representation while a GRU was helpful to model tweet sequences. Ranking could best deal with class imbalance. Figure 2 shows the confusion matrix for RCRNN. Apart from confusions with 'N' which most probably result from the class imbalance, the model confuses neighboring labels, such as 'I' and 'D'. In total, over 90% of the confusions involve 'N'. This shows that the model is reasonably good at distinguishing the purchase stages and that the main difficulty is class imbalance. In future work, we will extend the investigation of this topic.

Conclusion
We defined a purchase stage identification task based on the AIDA model. We compared several Figure 2: Confusion matrix on test set neural and non-neural models of tweets and tweet sequences and observed the best performance using RCRNN, our ranking-based hierarchical network which uses convolution to represent tweets and gated recurrent units to model tweet sequences. Our results indicate that tweets indeed contain signals indicative of purchase stages which can be captured by deep learning models. Ranking was the most effective way to deal with class imbalance.