Churn Identification in Microblogs using Convolutional Neural Networks with Structured Logical Knowledge

For brands, gaining new customer is more expensive than keeping an existing one. Therefore, the ability to keep customers in a brand is becoming more challenging these days. Churn happens when a customer leaves a brand to another competitor. Most of the previous work considers the problem of churn prediction using the Call Detail Records (CDRs). In this paper, we use micro-posts to classify customers into churny or non-churny. We explore the power of convolutional neural networks (CNNs) since they achieved state-of-the-art in various computer vision and NLP applications. However, the robustness of end-to-end models has some limitations such as the availability of a large amount of labeled data and uninterpretability of these models. We investigate the use of CNNs augmented with structured logic rules to overcome or reduce this issue. We developed our system called Churn_teacher by using an iterative distillation method that transfers the knowledge, extracted using just the combination of three logic rules, directly into the weight of the DNNs. Furthermore, we used weight normalization to speed up training our convolutional neural networks. Experimental results showed that with just these three rules, we were able to get state-of-the-art on publicly available Twitter dataset about three Telecom brands.

For brands, gaining new customer is more expensive than keeping an existing one. Therefore, the ability to keep customers in a brand is becoming more challenging these days. Churn happens when a customer leaves a brand to another competitor. Most of the previous work considers the problem of churn prediction using CDRs. In this paper, we use micro-posts to classify customers into churny or nonchurny. We explore the power of CNNs since they achieved state-of-the-art in various computer vision and NLP applications. However, the robustness of end-toend models has some limitations such as the availability of a large amount of labeled data and uninterpretability of these models. We investigate the use of CNNs augmented with structured logic rules to overcome or reduce this issue. We developed our system called Churn teacher by using an iterative distillation method that transfers the knowledge, extracted using just the combination of three logic rules, directly into the weight of Deep Neural Networks (DNNs). Furthermore, we used weight normalization to speed up training our convolutional neural networks. Experimental results showed that with just these three rules, we were able to get state-of-the-art on publicly available Twitter dataset about three Telecom brands.

Introduction
Customer churn may be defined as the process of losing a customer that recently switches from a brand to another competitor. The churn problem can be tackled from different angles: most of the previous work used Call Detail Records (CDRs) to identify churners from non-churners (Zaratiegui et al., 2015). More recently, with more data became available on the web, brands can use customers opinionated comments via social networks, forums and especially Twitter to detect churny from non-churny customers. We used the churn dataset developed by (Amiri and Daumé III, 2015). This dataset was collected from Twitter for three telecommunication brands: Verizon, T-Mobile, and AT&T.
In recent years, deep learning models have achieved great success in various domains and difficult problems such as computer vision (Krizhevsky et al., 2012) and speech recognition . In natural language processing, much of the work with deep learning models has involved language modeling (Bengio et al., 2003;Mikolov et al., 2013), sentiment analysis (Socher et al., 2013), and more recently, neural machine translation (Cho et al., 2014;Sutskever et al., 2014). Furthermore, these models can use backpropagation algorithm for training (Rumelhart et al., 1988).
Regardless of the success of deep neural networks, these models still have a gap compared to human learning process. While the success came from the high expressiveness, it leads them to predict results in uninterpretable ways, which could have a negative side effects on the whole learning process (Szegedy et al., 2013;Nguyen et al., 2015). In addition, these models require a huge amount of labeled data, which is considered as time consuming for the community since it requires human experts in most applications (natural language and computer vision). Recent works tackled this issue by trying to bridge this gap in different applications: in supervised learning such as machine translation (Wu et al., 2016) and unsupervised learning (Bengio et al., 2015).
In the Natural Language Processing (NLP) community, there has been much work to augment the training process with additional and useful features (Collobert et al., 2011) which proved its success in various NLP applications such as Named Entity Recognition (NER) and Sentiment Analysis. The majority of these works used pretrained word embeddings obtained from unlabeled data to initialize their word vectors, which allow them to improve the performance. Another solution came from integrating logical rules extracted from the data directly into the weights of neural networks. (Hu et al., 2016) explored a distillation framework that transfers structured knowledge coded as logic rules into the weights of neural networks. (Garcez et al., 2012) developed a neural network from a given rule to do reasoning.
In this paper, we combine the two ideas: firstly, we used unsupervised word representations to initialize our word vectors. We explored three different pretraiend word embeddings and compared the results with a randomly sampled one. Secondly, we used three main logic rules, which were proven to be useful and crucial. The "but" rule was explored by (Hu et al., 2016) in sentiment analysis and we add two new rules: "switch to" and "switch from". (Amiri and Daumé III, 2016) showed that the last two rules have a remarkable influence into the churn classification problem.
Moreover, in order to accelerate training our model on churn training dataset, we conduct an investigation of using weight normalization (Salimans and Kingma, 2016), which is a new recently developed method to accelerate training deep neu-ral networks.
Experiments on Twitter dataset built from a large number of tweets about three Telecommunication brands show that we were able to obtain state-of-the-art results for churn classification in microblogs. Our system, called Churn teacher, is constructed by using a structured logical knowledge expressed into three logic rules transferred into the weights of convolutional neural networks. We outperform the previous models based on hand engineering features or also using recurrent neural networks combined with minimal features. Our system is philosophical close to (Hu et al., 2016), which showed that combining deep neural networks with logic rules performed well on two NLP tasks: NER and Sentiment Analysis.
The rest of this paper is structured as follows: in section 2, we discuss the related work done in churn prediction application. In section 3, we present our churn prediction approach which is based on structured logical knowledge transferred into the weights of Convolutional Neural Networks (CNNs). In section 4, we discuss the impact of pretrained word embeddings on the churn classification. The experimental results are presented in section 5. Finally, we present the conclusion with the future work in section 6.

Related Work
Churn prediction is an important area of focus for sentiment analysis and opinion mining. In the 2009, ACM Conference on Knowledge Discovery and Datamining (KDD) hosted a competition on predicting mobile network churn using a large dataset posted by Orange Labs, which makes churn prediction, a promising application in the next few years. We can divide the previous work on Customer churn prediction in two research groups: the first group uses data from companies such as Telecom providers, banks, or other organizations. More recently, with the explosion of social networks, researchers are interested to use social networks such as Twitter to predict churners.
Using data from banks, (Keramati et al., 2016) developed a system for a customer churn in electronic banking services. They used a decision tree algorithm to build their classification model. The main goal of this paper is studying the most relevant features of churners in banking services such as demographic variables (age, gender, career, etc.), transaction data through electronic banking portals (ATM, mobile bank, telephone bank, etc.), and others. They used a method called CRISP-DM which contains six phases: Business understanding, Data understanding, Data preprocessing, Modeling, Evaluation and Deployment. At the final stage, they used a decision tree method to model the previous phases.
( Backiel et al., 2016) studied the impact of incorporating social network information into churn prediction models. The authors used three different machine learning (ML) techniques: logistic regression, neural networks and Cox proportional hazards. To extract features to use with these ML techniques, they built a call graph, which allowed them to extract the relevant features. (Li et al., 2016) developed a model based on stacked auto-encoder as a feature extractor to detect the most influential features in Telecom churn prediction. In addition, they proposed a second model where they augmented the previous model with a Fishers ration analysis called Hybrid Stack Auto-Encoder (HSAE). The models were evaluated on Orange datasets. Experimental results showed that the HSAE model outperformed all the other models including Principal Component Analysis (PCA).
More recently, researchers tackle the churn prediction problem using data collected from microblogs. (Amiri and Daumé III, 2015) developed a system for churn prediction in microblogs. They investigated the machine learning models such as support vector machines, and logistic regression with the combination of extracted features. Furthermore, they investigated the use of three different churn indicators: demographic, content, and context indicators. Experimental results showed that the combination of the three indicators lead to the best performance. (Amiri and Daumé III, 2016) used the power of Recurrent Neural Networks (RNN) as a representation learning models in order to learn micropost and churn indicator representations. The experiments on publicly available Twitter dataset showed the efficiency of the proposed method in classifying customers in churners and nonchurners. Moreover, authors showed that the churn classification problem is different from classical sentiment analysis problem since the previous state-of-the-art sentiment analysis systems failed to classify churny/non-churny customers.
In this work, we focus on churn prediction in microblogs where we use a publicly available Twitter dataset provided by (Amiri and Daumé III, 2015) to evaluate our system.

The Proposed System
In this section, we introduce our system that enables a convolutional neural network to learn simultaneously from logic rules and labeled data in order to classify customers as churners and nonchurners. The general architecture of our system can be seen as the combination of the knowledge distillation (Hinton et al., 2015) and the posterior regularization method (Ganchev et al., 2010). (Hu et al., 2016) explored this combination to build two systems for English sentiment analysis and named entity recognition. We show that this framework can also be applied to customer churn prediction by deriving more logic rules and transfer the structured logical knowledge into the weights of a convolutional neural network.

Problem Formulation
For the purpose of this paper, let us assume we have x ∈ X is the input variable and y ∈ Y is the target variable. Let us consider the training data D = {(x n , y n )} N n=1 which is a set of instantiations of (x, y) and N is the number of training examples in our dataset. For the purpose and the clarity of this paper, we focus on classification problem where the target y is a one-hot encoding of the class labels. We consider a subset of the training data (X, Y ) ⊂ (X , Y) as a set of instances of (x, y). A neural network defines a conditional probability p θ (y|x) parameterized by θ.

Neural Network with Structured Logical Knowledge
In this section, we describe the distillation method that allowed our system to transfer the structured logical knowledge into the weights of convolutional neural networks to classify customers as churners and non-churners. Let us define the set of constraint functions f l such that: f l ∈ X × Y → R, where l is the index of a specific constraint function. In our problem, the set of functions will be represented as logical rules where the overall truth values are in the interval [0, 1]. These functions will allow us to encode the structured logical knowledge where the goal is to satisfy (i.e., maximize by optimizing the predictions y) with confidence weights λ l ∈ R.
The construction of a structure-enriched teacher network q at each iteration from the neural network parameters is obtained by solving the following optimization problem: where P denotes the appropriate distribution space; and C is the regularization parameter. In this paper, the teacher is called Churn teacher where its main goal is to teach the model to classify customers from churners and non-churners. The closeness between our Churn teacher and p θ , which represents the conditional probability obtained by the softmax output layer of the convolutional neural network, is measured using the Kullback-Leibler (KL) divergence where the aim is minimizing it. We note that problem (1) is convex and has a closed-form solution given by the following: It should be noted that the normalization term can be computed efficiently through direct enumeration of the chosen rule constraints. At each iteration, the probability distribution of the neural network p θ is updated using the distillation objective (Hinton et al., 2015) that balances between imitating soft predictions of our Churn teacher q and predicting true hard labels: where l denotes the cross entropy loss function that we used in this paper; N is the training size; σ θ (x) is the softmax output of p θ on x; sp n is the soft prediction vector of the Churn teacher q on training point x n at iteration t and π is the imitation parameter calibrating the relative importance of the two objectives. In addition to their teacher q, (Hu et al., 2016) tested their results using a student network p. Related to their experimental results, the teacher q always gives better results than the student p. We decide to only use the teacher network q called in our work the Churn teacher.

Neural Network Architecture
In this section, we give an overview of the convolutional neural network used in our work. The main architecture of our CNN is depicted in Figure  1. We followed the convolutional neural network architecture as proposed by (Kim, 2014).
In the first step, we initialize each word in a sentence T with length n with pretrained word representations learned from unannotated corpora. Then, we add padding whenever it is necessary for the model. T is represented as the following: where v i represents the word vector of the i−th word in the sentence T and ⊕ represents the concatenation operator. We use successive filters w to obtain multiples feature map. Each filter is applied to a window of m words to get a single feature map: where b is the bias and φ denotes an elementwise nonlinearity where we used ReLU (Rectified Linear Unit). In the next step, we applied a max-over-time pooling operation (Collobert et al., 2011) to the feature map and take the maximum value. The results are fed to a fully connected softmax layer to get probabilities over the sentences. Figure 1 illustrates the architecture of our system where we consider the system is classifying the input sentence: "Damn thats crud. You should switch to Verizon".

Logic Rules
In the early stages of the expansion of artificial intelligence, (Minsky, 1983) argued that the cognitive process of human beings learn from two different sources: examples as deep neural networks are doing these days and also from rich experiences and general knowledge. For this reason, we will use both of the sources for churn prediction in microblogs where the convolutional neural network learns from examples and logic rules add structured knowledge into the weights of CNN by playing a role of a regularizer in the learning process.
In this section, we present the three logic rules that we used in our churn prediction system. We Figure 1: The system architecture. The word vectors are initialized with pretrained word representations from one of the three models: GloVe, Skip-Gram or CBOW. We feed these word vectors to the convolutional neural network as input sentences followed by a max-pooling overtime followed by a fully-connected layer and softmax to get probabilities (FC with Softmax) borrow the first logic rule from sentiment analysis literature by using the conjunction word "but". It has been shown that "but" played a vital role in determining the sentiment of a sentence where the overall sentiment is dominated by the clauses following the word "but" (Jia et al., 2009;Dadvar et al., 2011;Hogenboom et al., 2011;Hu et al., 2016). For a given sentence with "C1 but C2", we assume that the sentiment of the whole sentiment will take the polarity of clause C2.
The second logic rule that we used is "switch from" considered as a target-dependent churn classification rule. (Amiri and Daumé III, 2016) showed that "switch from" can have an important role to classify if the customer will be churner or non-churner. The last logic rule that we explored in our work is similar to the second rule for being target dependent churn classification where we substitute the preposition "from" to be the preposition "to" to obtain "switch to". For a given sentence with the form "C1 switch to C2", it is clear that the customer will choose to switch to the brand present in clause C2.
Consider the two examples from the training data: • Damn thats crud. You should switch to Verizon.
• Gonna switch from bad @verizon internet to @comcast. @VerizonFiOS will never be in my area and i bet @googlefiber will get here first.
In the first example, the customer will switch to the brand "Verizon" while for the second example, the customer will leave the brand "Verizon" to another competitor. Consequently, with respect to the brand "Verizon", the first tweet is classified as "Non-churny" and the second tweet is classified as "Churny".
For the two other logic rules "switch to" and "switch from", we followed the same structure of the "but" rule with a slightly different settings: has "C1-switch to/from-C2" structure(S) =⇒ For the logic rule "switch to", if we are classifying the sentence S with respect to a brand in the clause C2, then the argument will be true which gives the formula: (1 + (σ θ (C2) + )/2. The logic rule "switch from" plays an opposite role where if a brand is in clause C2, the overall sentiment will be negative with respect to this brand, so we use this formula: (2 − σ θ (C2) + )/2.

Training Details
Training is done using stochastic gradient descent over mini-batches with the Adadelta update rule (Zeiler, 2012). Word vectors are initialized using pretrained word embeddings and finetuned throughout training. At each iteration of the training process, we enumerate the rule constraints (or a set of rules if we use them all at once) in order to compute the soft predictions of our Churn teacher q by using the equation 2.
During the experiments, we choose the imitation parameter to be π(t) = 1 − 0.85t and the regularization parameter to be C = 100. We set the confidence levels of rules to be λ l = 1. It should be noted that we used the model results on development set in order to select the best hyperparameters. The training procedure is summarized in algorithm 1. The rule set R = {(R l , λ l )} 3 l=1 Parameters: π -imitation parameter C regularization strength Initialize neural network parameters Choose the rule Rl or a set of rules do Sample a minibatch (X, Y ) ⊂ D Build the Churn teacher network using Equation 2 Update p θ using Equation 3 while not converged Output: Churn teacher q

The Effect of Weight Normalization
It should be noted that we use weight normalization which is a new method introduced by (Salimans and Kingma, 2016) to reparameterize the weight vectors in a deep neural network in order to decouple the length of those weight vectors from their direction. Using this method, the authors showed that they were able to improve the conditioning of the optimization problem and speed up convergence of stochastic gradient descent. This method followed earlier work by (Ioffe and Szegedy, 2015) where they introduced batch normalization trying to normalize each neuron output by the mean and standard deviation of the outputs computed over a minibatch of examples. More recently, this method was widely used in deep learning architectures such as deep convolutional neural networks, deep reinforcement learning, generative adversarial networks (GANs) and others (Smith and Topin, 2016;Gehring et al., 2017). Using weight normalization in training our convolutional neural network allowed us to accelerate convergence of stochastic gradient descent.

Input Word Embeddings
The research in representations of words as continuous vectors has a long history where many ideas were proposed (Hinton et al., 1985). More recently, (Bengio et al., 2003) proposed a model architecture based on feedforward neural networks for estimating neural network language model. The most popular model for word representations was developed by (Mikolov et al., 2013) called word2vec where they used either of two model architectures to produce a distributed representation of words: continuous bag-of-words (CBOW) model or Skip-Gram (SG) model. Another popular model for word representations developed by (Pennington et al., 2014) called "GloVe" (Global Vectors). The main difference between  this model and word2vec models is the representations of a word in vector space: word2vec models use a window approach while GloVe uses the global statistics of word-word co-occurrence in the corpus to be captured by the model. (Hisamoto et al., 2013) used word embeddings features for English dependency parsing where they employed flat (non-hierarchical) cluster IDs and binary strings obtained via sign quantization of the vectors. For chunking, (Turian et al., 2010) showed that adding word embeddings allows the English chunker to increase its F1-score. (Huang et al., 2014) showed that adding word embeddings as features for English part-of-speech (POS) tagging task helped the model to increase its performance. (Bansal et al., 2014) argued that using word embeddings in parsing English text improved the system performance. For English sentiment analysis, (Kim, 2014) showed that using pretrained word embeddings helped their system to improve its accuracy.
As in (Collobert et al., 2011), in order to test the importance of pretrained word embeddings in churn prediction for microblogs, we performed experiments with different sets of publicly published word embeddings, as well as a random sampling method, to initialize word vectors in our model. We investigate three different pretrained word embeddings: Skip-Gram, continuous bag-of-words and Stanford's GloVe model. Table 1 Table 3: Results with the three logic rules compared to the without and with word embeddings. We test the model by using each rule independently and after that we combine them in one experiment.
to the results in Table 1, we see that using pretrained word embeddings obtain a significant improvement, about 3.54% in F1 score as opposed to the ones using random embeddings. This is consistent with results reported by previous work done in other NLP tasks (Collobert et al., 2011;Chiu and Nichols, 2015). For different pretrained embeddings, Stanford's GloVe 300 dimensional embeddings achieves best results, about 1.12% better than Skip-gram model and 0.78% better than continuous bag-of-words model. One possible reason that Word2Vec is not as good as the Stanford's GloVe model is because of vocabulary mismatch, since Word2Vec embeddings were trained in case-sensitive manner, excluding many common symbols such as punctuations and digits. Because we do not use any kind of data pre-processing to deal with such common symbols or rare words, it might be an issue for using Word2Vec.

Experimental Results
For the evaluation of our model, we use the dataset provided by (Amiri and Daumé III, 2015). The authors collected the data from twitter for three telecom brands: Verizon, T-Mobile, and AT&T. Table 2 presents the details about the entries of this dataset. We divide the experimental process into two stages: the first stage concerns running the experiments using the convolutional neural network without and with different logic rules in order to select the best achieved results. In the second stage, we compare our best settings with the previous state-of-art system in churn prediction in microblogs. Table 3 shows the churn classification results. The first row represents the baseline where we use only the convolutional neural network. In the second row, we initialize our word vectors using  pretrained word vectors using GloVe model since it gives us the best results among the other pretrained word vectors. We get an improvement around 2.5% in F1-score. We refer to this model by "CNN-pre". This results is consistent with the fact that these pretrained word vectors are universal feature extractors that shown an important results in different NLP applications such as sentiment analysis, named entity recognition and Partof-speech tagging. By transferring the knowledge extracted using the "but" logic rule into the weights of the convolutional neural network, we were able to improve the F1-score over the CNN-pre model by 1.28 points in F1-score. For the "switch from" logic rule, we get a slight improvement over the CNN-pre model by 0.25 points in F1-score. The biggest improvement among the three logic rules was obtained by the "switch to" rule where we were able to improve the performance over the CNN-pre model by 1.93 points in F1-score. The last row in Table 3 concerns the results that we obtained by using all the three logic rules where logically we achieved best results and outperformed the CNN-pre model by 3.18 points in F1-score.
While we do not have a complete explanation why we got better results with "switch to" rule, we believe that it is caused by the fact that there 321 sentences in the training data containing this rule which represents around 8% sentences contains the word "switch to". Moreover, it will be clear that customer will leave a specific brand to another new brand. For the "switch from" rule, we get slight improvement over the CNN-pre model because few sentences containing this rule (around 2% sentences contains the word "switch from"). Table 4 shows the statistics about the presence of the three rules in the training data. For "but" rule, we also get an important improvement over the CNN-pre model which confirms the results obtained by (Hu et al., 2016) using the same rule in sentiment analysis. We note that around 9% sen-Models F1 score Unigram + Nb (Amiri and Daumé III, 2015) 73.42 (Amiri and Daumé III, 2016) 78.30 Churn teacher 83.85 Table 5: Comparison of our system with two previous systems.
tences contains the word "but". In the last row, we combine all the three rules and we were able to obtain the best performance. We refer to this model as "Churn teacher". This is consistent with the argument by (Hu et al., 2016) where they argued that more rules will allow the system to improve its performance over the base convolutional neural network. We test our model using this dataset and compare the obtained results with two other systems. The state-of-the-art results were produced by (Amiri and Daumé III, 2016) where they achieved 78.30% in F1 score. They used a combination of Bag of Words features and Recurrent Neural Networks. The second system referenced here as "Unigram + Nb" developed by (Amiri and Daumé III, 2015) used different N-grams (n = 1, 2, 3) and their combination on both of the word and character levels.
By adding three rules to the convolutional neural networks, we outperformed the "Unigram + Nb" system by a large margin (10.43 points in F1score). Furthermore, our model also outperformed the system developed by (Amiri and Daumé III, 2016) by a good margin (5.55 points in F1-score). Table 5 shows a brief presentation of the experimental results and the comparison with the two other systems.

Conclusion
In this paper, we explored the problem of targetdependent churn classification in microblogs. We combine the power of convolutional neural networks with structured logic knowledge by constructing a churn teacher capable of classifying customers into churners and non-churners. In addition, we confirm that initializing word vectors with pretrained word embeddings trained on unannotated corpora improved the system performance.
A key aspect of our system is that it explores the transfer of the structured knowledge of logic rules into the weights of convolutional neural networks for churn classification problem. By com-bining three logic rules, our model largely outperformed all the previous models on publicly available Twitter dataset. We showed that "but" rule is also useful for churn prediction to confirm the results obtained for sentiment classification problem. We consider the two other rules ("switch to" and "switch from") as target-specific rules for churn classification problem which helped the system to improve its performance.
In the future work, we will explore the use of character-level embeddings where we will represent word in a sentence by a word vector representing the concatenation of two embeddings: its equivalent word embeddings obtained from the lookup table and the embeddings obtained from its characters. Furthermore, we will explore the use of named entity recognition to recognize different organizations where we will focus on brands which we believe could help us to a better churn classification.