Camouflaged Chinese Spam Content Detection with Semi-supervised Generative Active Learning

We propose a Semi-supervIsed GeNerative Active Learning (SIGNAL) model to address the imbalance, efficiency, and text camouflage problems of Chinese text spam detection task. A “self-diversity” criterion is proposed for measuring the “worthiness” of a candidate for annotation. A semi-supervised variational autoencoder with masked attention learning approach and a character variation graph-enhanced augmentation procedure are proposed for data augmentation. The preliminary experiment demonstrates the proposed SIGNAL model is not only sensitive to spam sample selection, but also can improve the performance of a series of conventional active learning models for Chinese spam detection task. To the best of our knowledge, this is the first work to integrate active learning and semi-supervised generative learning for text spam detection.


Introduction
The recent successes of learning-based models all share the same prerequisite: a decent labeled training dataset is available for a given task (Jiang et al., 2019b;Arora and Agarwal, 2007). However, the annotating process can be "a tedious, laborious, and time consuming task for humans" (Sharma et al., 2015). To achieve high task performance with low labeling cost, (pool-based) active learning (Cohn et al., 1996) algorithms are proposed to select the most representative and informative sample to be labeled by human oracles (Druck et al., 2009). Although effective in general, in Chinese text spam detection context, the following reasons make the active learning a challenging task: * These two authors contributed equally to this research. † Corresponding author Imbalance: in reality, the ratio of spam samples to normal ones is very imbalanced. For instance, in North America, "much less than 1% of SMS messages were spam" (Almeida et al., 2013). As a result, the active learning model should be more sensitive to spam samples. The general active learning methods, e.g., (Lewis and Gale, 1994;Li and Guo, 2013;Roth and Small, 2006), can hardly address this problem. Efficiency: when competing with anti-spam models, spammers are constantly creating new forms for spam texts (Xie et al., 2012;Jiang et al., 2019a). The amount of unlabeled samples is huge and keeps increasing. Classical diversity-based approach (Brinker, 2003;Xu et al., 2003), which iteratively compares each unlabeled sample with each labeled sample to select the most "diverse" ones for annotating, will perform poorly as its computational complexity is O(n 2 ). An efficient-oriented active learning algorithm is needed. Camouflage 1 : Chinese character has glyph and phonetic variations (Norman, 1988), e.g., "账 (account)" and "帐(curtain)" have the similar structure and pronunciation. Spammers can take advantage of this characteristic to escape from the detection algorithms (Jindal and Liu, 2007;Jiang et al., 2019a). It is important to propose a novel active learning model that can predict the new Chinese character variation patterns not appearing in the labeled dataset.
To address these challenges, we propose a novel solution, Semi-supervIsed GeNerative Active Learning (SIGNAL) model to naturally integrate active learning and semi-supervised generative learning into a unified framework. SIGNAL is inspired by a simple yet powerful observation in computer vision domain (Zhou et al., 2017) : the patches generated from the same image share the same label, and are naturally expected to have similar predictions by the classifier. Hence, the diversity of predictions of patches can successfully measure the "power" of a candidate image in elevating the performance of the current classifier. Similarly, in this study, a set of semantically similar texts for each candidate sample is automatically generated through data augmentation. We hypothesize that: the diversity of predictions of augmented texts is a useful indicator to predict the boost ability of a candidate text sample for the performance of the classifier. We define this strategy as a "self-diversity" based active learning strategy.
Algorithmically, unsupervised generative models, such as variational autoencoder (Kingma and Welling, 2013), only learn to generate similar texts without considering the labeling information. Therefore, we utilize a Semi-supervised Variational AutoEncoder (S-VAE) (Kingma et al., 2014) to automatically generate semantically similar texts for each candidate sample, while trying to keep the label-consistency. To enable S-VAE to gain the ability of perceiving the sensitive positions of the candidate sample, we enrich the human annotation feedback. The annotator is required to provide not only a label for the candidate but also a rationale (critical terms in the candidate) (Sharma et al., 2015) for the chosen spam label. Based on the human-annotated rationales, we introduce a pseudo-mask distribution P m to guide the attention learning in S-VAE. A character variation graphenhanced augmentation procedure is then applied to integrate the Chinese character variation knowledge and simulate the glyph and phonetic variation mutations in further data augmentation.
Compared with conventional active learning, SIGNAL offers three advantages: (1) SIGNAL is more sensitive to seek the spam samples 2 . (2) SIG-NAL does not need to compare with the labeled samples, which reduces its computational complexity to O(N ). (3) SIGNAL considers the heterogeneous variation knowledge of Chinese characters for spam detection.
The major contributions of this paper can be summarized as follows: 1. We propose a SIGNAL model, in the context of Chinese text spam detection, to address the imbalance, efficiency, and text camouflage problems.
To the best of our knowledge, this is the first work to integrate active learning and semi-supervised generative learning for text spam detection task.
2. The preliminary experiments on the Chinese SMS dataset demonstrate the efficacy and potential of SIGNAL for Chinese spam detection. A series of conventional active learning models can be improved after merging the SIGNAL model.
3. While focusing on the Chinese spam detection task in this study; theoretically, SIGNAL has a great potential to be applied in other NLP tasks. It can mitigate the data-hungry problem by cutting the labeling cost. Figure 1 depicts the proposed SIGNAL framework 3 . It starts with a small set of labeled samples, a large set of unlabeled samples, and an initial classifier trained on the labeled samples. The goal of SIGNAL is to seek "salient" samples from the pool of unlabeled samples for annotation. Then the classifier can be continuously improved by incrementally enlarging the training set with newly annotated samples. The pseudocode of SIGNAL is described as Algorithm 1.

SIGNAL Model
Self-Diversity Based Active Learning. As aforementioned, in SIGNAL, we develop a "self-diversity" criterion for active candidate selection.
Formally, for a candidate sample x i , a set of augmented texts AT i = at 1 i , at 2 i , · · · , at j i · · · , at M i is generated. The self-diversity SD i of x i can be defined as: p j i is the prediction of the current classifier for augmented text at j i ;p i is the arithmetic mean of all predictions for AT i ; M is the total number of augmented texts. SD suggests the "worthiness" of a candidate for annotation. A large SD indicates that the current classifier's prediction for the target candidate is unstable. With a slight mutation, the prediction will change drastically. Such a candidate is worthy of annotation. This criterion has the potential to locate the vital samples and also to reduce the computational complexity. Furthermore, Figure 1: An Illustration of "SIGNAL" Framework in the context of Chinese text spam detection, spam candidate has a greater possibility to gain a larger SD. For instance, if the spam candidate mutates at the critical positions, the label of the augmented text is likely to change. On the contrary, normal candidates are less likely to be affected by this situation.
S-VAE with Masked Attention Learning. As shown in Figure 1, we utilize S-VAE with masked attention learning to generate similar texts at the semantic level. In this study, with annotated rationales R (a set of critical terms), a pseudo-mask distribution P m is generated for each candidate sample. For i th term t i of the candidate sample, the pseudo-mask probability P r i can be calculated as: where I R (t i ) is an indicator function to determine whether t i belongs to R; ∆ is used for normalization; ρ is the weight to ensure the critical terms will have less attention, in other words, it can have a greater possibility to be "masked" during the generative process. Following (Kingma et al., 2014), the generative semi-supervised model with masked attention learning can be defined as: P r(y) = Cat(y|π); P r(z) = N (z|0, I); P r ω (x |f r (x)) = f a (x ; f r (x), ω); P r θ (x |y, z) = f (x ; y, z, θ) where x is a sample (labeled or unlabeled); f r (x) is a matrix generated by a non-linear transformation of x. x is a representation of f r (x) with an attention calculation, x = ω i f r (x) i ; ω denotes the attention distribution, ω i = softmax(f c (f r (x)) i , which is scalar; f c is an single-dimensional nonlinear transformation; Cat(y|π) is the multinomial distribution, if x is unlabeled, the class labels y are treated as latent variables; z is the latent variable; θ denotes the parameters of a non-linear transformation. Labeled samples can be used to train a classifier that predicts class labels y. During the inference process, we can predict the missing class for an unlabeled sample from the inferred posterior distribution P r θ (y|x ).
The loss function of S-VAE with masked attention learning is defined as: where L S−VAE is the loss of original S-VAE (Kingma et al., 2014); D KL (P m ||P att ) is the KL divergence of the attention distribution P att from the pseudo-mask distribution P m . Character Variation Graph-enhanced Augmentation. In this study, a random-walk based graph-enhanced augmentation procedure is used for integrating the Chinese character variation knowledge and simulating the glyph and phonetic variation mutations. A Chinese character variation graph G (Jiang et al., 2019a) is utilized. G = (C, R). C denotes the Chinese character (vertex) set. R denotes the variation relation (edge) set, and edge weight is the similarity of two characters given the target relation (variation) type. For critical positions in a piece of text, we adopt a random walk based graph exploration to predict the possible Chinese character variation patterns. For Algorithm 1 Semi-supervised Generative Active Learning Self-Diversity Based Active Learning (Labeled set: L, Unlabeled set: U = {x1, · · · , xN }, Initial Classifier: Ct, t = 0, Chinese Character Variation Graph: G, Annotated Rationales: R) R = ∅ repeat for all xi ∈ U do With R, generate a pseudo-mask distribution P i m using Eq.2 SSi = S-VAE(xi, P i m ) ATi = GraphAugmentation(SSi, G, P i m ) With ATi and Ct, calculate SDi using Eq.1 end for Select top K unlabeled samples Q from U GetL andR from enriched human annotation L ← L L , R ← R R , U ← U/ Q t + +, Ct ← T rain(L, Ct−1) until Convergence return Ct, L GraphAugmentation(Similar text set: SS, Chinese Character Variation Graph: G, pseudo-masked distribution Pm,) AT = ∅ for all ssj ∈ SS do Probabilistically generate a position list P OS with Pm for all pos k ∈ P OS do Get the character Chpos k at position pos k Cho ← Chpos k Randomly generate a walking step Tp ∈ (0, T ] Chn = RandomWalk(Cho, Tp, G) Chpos k ← Chn end for Append ssj to AT end for return AT more detailed information on this procedure, please refer to Algorithm 1.

Preliminary Experiment
Dataset and Experiment Setting. A Chinese SMS dataset 4 was used for the experiment. There were 48,896 testing samples, including 23,891 spam samples and 25,005 normal samples. The size of the active learning sample set was 48884, including 23,891 spam samples and 24,993 normal samples. 200 samples were randomly selected as the initial labeled set. The remaining samples were used as an unlabeled sample pool. For each iteration, 100 samples were selected by different active learning models. The iterative active learning process repeated 10 times. For evaluation, a singlelayer CNN classifier was trained on the labeled samples. Uncertainty (Lewis and Gale, 1994), Margin (Roth and Small, 2006), and Entropy (Li comparison between "Uncertainty" and "Uncertainty merging SIGNAL"; (C) the classifier performance (accuracy) comparison between "Entropy" and "Entropy merging SIGNAL"; (D) the classifier performance (accuracy) comparison between "Margin" and "Margin merging SIGNAL" and Guo, 2013) were chosen as baseline models. Similar baseline-settings can be found in (Zhou et al., 2017;Huang et al., 2018;Yoo and Kweon, 2019).
In SIGNAL model, for S-VAE training 4 , we chose "BiGRU+ Attention + MLP" as encoder structure, a "single-layer GRU" as decoder structure, and a "single-layer CNN+MLP" as classifier. For each candidate sample, 10 augmented texts is generated for "self-diversity" calculation.
Sensitivity of Spam Sample Selection. As shown in Figure 2 (A), compared with baseline models, SIGNAL can be more sensitive to spam samples. The selected spam samples from SIGNAL were significantly more than those from other baselines. This observation indicated the potential of SIGNAL for addressing the "imbalance" problem in Chinese text spam detection.
The Elevating "Power" of SIGNAL. As shown in Figure 2 (B), (C), and (D), after merging 5 SIGNAL, all baseline models had been improved to varying degrees. Especially for margin-based active learning (Roth and Small, 2006), SIGNAL can improve the performance in all active learning iterations. Averagely, by merging SIGNAL, Margin can be improved by 10% in the metric of the Case Study. To gain a straightforward understanding of the generation quality of SIGNAL, we present two augmented texts in Figure 3. From these two cases, we have the following observations: (1) the augmented texts are semantically similar to the original sample. (2) Although the original sample has no variation character, the augmented texts can simulate the phonetic or glyph variation mutations. (3) If the critical terms in the original sample are replaced, the label of text can be different.

Conclusion
In this paper, we propose a SIGNAL model for Chinese text spam detection. SIGNAL integrates active learning and semi-supervised generative learning into a unified framework. As an exploration study for this newly proposed problem, the preliminary results have revealed the potential of SIGNAL to address the critical problems in the proposed task. For instance, Figure 2 (A) proves that SIGNAL can be more sensitive to spam samples (Imbalance Challenge); case study ( Figure 3) shows the generation capacity of SIGNAL to simulate the phonetic or glyph variation mutations (Camouflage Challenge); comparing to classical diversity-based approach, we integrate self-diversity based active learning and generative learning which can greatly reduce the computational complexity (O (N ) → O (N ), Efficiency Challenge).
In the future, we plan to enable the glyph and phonetic variation detection by integrating the variation graph representation learning, which may im-prove SIGNAL's performance.