Detect Camouflaged Spam Content via StoneSkipping: Graph and Text Joint Embedding for Chinese Character Variation Representation

The task of Chinese text spam detection is very challenging due to both glyph and phonetic variations of Chinese characters. This paper proposes a novel framework to jointly model Chinese variational, semantic, and contextualized representations for Chinese text spam detection task. In particular, a Variation Family-enhanced Graph Embedding (VFGE) algorithm is designed based on a Chinese character variation graph. The VFGE can learn both the graph embeddings of the Chinese characters (local) and the latent variation families (global). Furthermore, an enhanced bidirectional language model, with a combination gate function and an aggregation learning function, is proposed to integrate the graph and text information while capturing the sequential information. Extensive experiments have been conducted on both SMS and review datasets, to show the proposed method outperforms a series of state-of-the-art models for Chinese spam detection.


Introduction
Chinese orchestrates over tens of thousands of characters by utilizing their morphological information, e.g., pictograms, simple/compound ideograms, and phono-semantic compounds (Norman, 1988). Different characters, however, may share the similar glyph and phonetic "root". For instance, from glyph perspective, character "裸 (naked)" looks like "课 (course)" (homographs), while from phonetic viewpoint, it shares the similar pronunciation with "锣 (gong)" (homophones). The form of variations can also be compounded, for instance, "账 (account)"and "帐 * These two authors contributed equally to this research. † Corresponding author (curtain)" have the similar structure and pronunciation (homonyms). Unfortunately, in the context of spam detection, as shown in Figure 1, spammers are able to take advantage of these variations to escape from the detection algorithms (Jindal and Liu, 2007). For instance, in the e-commerce ecosystem, variation-based Chinese spam mutations thrive to spread illegal, misleading, and harmful information 1 . In this study, we propose a novel problem -Chinese Spam Variation Detection (CSVD), a.k.a. investigating an effective Chinese character embedding model to assist the classification models to detect the variations of Chinese spam text, which needs to address the following key challenges. Diversity: the variation patterns of Chinese characters can be complex and subtle, which are difficult to generalize and detect. For instance, in the experimental dataset, one Chinese character can have 297 (glyph and phonetic) variants averagely and 2,332 maximally. The existing keyword based spam detection approaches, e.g., (Ntoulas et al., 2006), can hardly address this problem. Sparseness, Zero-shot, and Dynamics: when competing with classification models, spammers are constantly creating new Chinese characters combinations for spam texts (that can be a "zero/few shot learning" problem (Socher et al., 2013)). The labelling cost can be inevitably high in such dynamic circumstance. Data driven approaches, e.g., , will perform poorly for unseen data. Camouflage: with the common cognition knowledge of Chinese and the contextual information, users are able to consume the spam information, even when some characters in the content are intentionally mutated into their similar variations (Spinks et al., 2000;Shu and Anderson, 1997). However, the variation-based spam text are highly camouflaged for machines. It is important to propose a novel Chinese character representation learning model that can synthesize character variation knowledge, semantics, and contextualized information.
To address these challenges, we propose a novel solution, StoneSkipping (SS) model to enable Chinese variation representation learning via graph and text joint embedding. SS is able to learn the Chinese character variation knowledge and predict the new variations not appearing in the training set by utilizing sophisticated heterogeneous graph mining method. For a piece of text (a character sequence), with the proposed model, each candidate character can probe character variation graph (like stone bouncing cross the water surface), and explore its glyph and phonetic variation information (like the ripples caused by the stone hitting the water). Algorithmically, a Variation Family-enhanced Graph Embedding (VFGE) algorithm is proposed to extract the heterogeneous Chinese variation knowledge while learning the (local) graph representation of a Chinese character along with the (global) representation of the latent variation families. Finally, an enhanced bidirectional language model, with a combination gate function and an aggregation learning function, is proposed to comprehensively learn the variation, semantic, and sequential information of Chinese characters. To the best of our knowledge, this is the first work to use graph embedding to learn the heterogeneous variation knowledge of Chinese characters for spam detection.
The major contributions of this paper can be summarized as follows: 1. We propose an innovative CSVD problem, in the context of text spam detection, to address the diversity, sparseness, and text camouflage problems.
2. A novel joint embedding SS model is proposed to learn the variational, semantic, and contextual representations of Chinese characters. SS is able to predict unseen variations.
3. A Chinese character variation graph is constructed for encapsulating the glyph and phonetic relationships among Chinese characters. Since the graph can be potentially useful for other NLP tasks, we share the graph/embeddings to motivate further investigation.
4. Through the extensive experiments on both SMS and review datasets 2 , we demonstrate the efficacy of the proposed method for Chinese spam detection. The proposed method outperforms the state-of-the-art models.

Related Work
Neural Word Embeddings. Unlike traditional word representations, low-dimensional distributed word representations (Mikolov et al., 2013;Pennington et al., 2014) are able to capture indepth semantics of text content. More recently, ELMo (Peters et al., 2018) employed learning functions of the internal states of a deep bidirectional language model to generate the character embeddings. BERT (Devlin et al., 2018) utilized bidirectional encoder representations from transformers (Vaswani et al., 2017) and achieved improvements for multiple NLP tasks. However, all the prior models only focused on learning the context, whereas the text variation was ignored. Moreover, CSVD problem can be different from other NLP tasks: the intentional character mutations and unseen variations (zero-shot learning (Socher et al., 2013)) can threaten prior models' performances.
Chinese Word and Sub-word Embeddings. A number of studies explored Chinese representation learning methodologies. CWE (Chen et al., 2015) learned the character and word embeddings to improve the representation performance. GWE (Su and Lee, 2017) introduced the features extracted from the images of traditional Chinese characters. JWE (Yu et al., 2017) used deep learning to generate character embedding based on an extended radical collection. Cw2vec (Cao et al., 2018) in-vestigated Chinese character as a sequence of ngram stroke order to generate its embedding. Although these models had considered the nature of Chinese characters, they only utilized glyph features while the phonetic information was ignored. In CSVD problem, the forms of variations can be heterogeneous, and a single kind of features cannot cover all mutation patterns. More importantly, all these models are not designed for spam detection, and the task-oriented model should be able to highlight the most important variations for spam text.
Graph Embedding. Graph (a.k.a. information network) is a natural data structure for characterizing the multiple relationships between the objects. Recently, multiple graph embedding algorithms are proposed to learn the low dimensional feature representations of vertexes in graphs. Deep-Walk (Perozzi et al., 2014) and Node2vec (Grover and Leskovec, 2016) are random walk based models. LINE (Tang et al., 2015) modeled 1st and 2nd order graph neighbourhood. Meanwhile, metap-ath2vec++ (Dong et al., 2017) was designed for heterogeneous graph embedding with human defined metapath rules. HEER (Shi et al., 2018) is a recent state-of-the-art heterogeneous graph embedding model. Though the techniques utilized in these models are different, most existing graph embedding models focus more on local graph structure representation, e.g., modelling of a fixed-size graph neighbourhood. CSVD problem requires graph embedding conducted from a more global perspective, to characterize comprehensive variation patterns.
Spelling Correction. Spelling correction may serve as an alternative to address CSVD problem, e.g., using dictionary-based (Yeh et al., 2014) or language model-based method (Yu and Li, 2014) to restore the content variations to their regular format. However, because spammers intentionally mutate the spam text to escape from the detection model, training data sparseness and dynamics may challenge this approach. Figure 2 depicts the proposed SS model. There are three core modules in SS: a Chinese character variation graph to host the heterogeneous variation information; a variation family-enhanced graph embedding for Chinese character variation knowledge extraction and graph representa-tion learning; an enhanced bidirectional language model for joint representation learning. In the remaining of this section, we will introduce them in detail.

Chinese Character Variation Graph
A Chinese character variation graph 3 can be denoted as G = (C, R). C denotes the Chinese character set, and each character is represented as a vertex in G. R denotes the variation relation (edge) set, and edge weight is the similarity of two characters given the target relation (variation) type. To accurately characterize both phonetic and glyph information of Chinese character, we utilize three different encoding methods: Pinyin system provides phonetic-based information, which is widely used for representing the pronunciations of Chinese characters (Chen and Lee, 2000). In this system, each Chinese character has one syllable which consists of three components: an initial (consonant), a final (vowel), and a tone. There are four types of tones in Modern Standard Mandarin Chinese. Different tones with the same syllable can have different meanings. For instance, the pinyin code of "裸 (naked)" is "luo3" and "锣 (gong)" is 'luo2". The pinyin-based variation similarity is calculated based on their pinyin syllables with tones 4 .
Stroke is a basic glyph pattern for writing Chinese character (Cao et al., 2018). All Chinese characters are written in a certain stroke order and can be represented as a stroke code, e.g., the stroke code of "裸 (naked)" is "4523425111234" and '课 (course)" is "4525111234". The strokebased variational similarity is calculated based on longest common substring and longest common subsequence metrics 4 .
Zhengma is another important means for glyph character encoding, which encodes character at radical level (Yu et al., 2017). For instance, the Zhengma code of "裸 (naked)" is "WTKF" and '课 (course)" is "SKF". The Zhengma-based variational similarity is calculated based on the Jaccard Index metric 4 .
Unlike previous works (Cao et al., 2018;Yu et al., 2017) only employ one kind of glyph-based information, we utilize two different glyph pat- Figure 2: An Illustration of "StoneSkipping" Framework terns (stroke and Zhengma) to encode the Chinese character. Because these two patterns can characterize Chinese characters from different internal structural levels, and complement each other to enable an enhanced glyph representation learning. Furthermore, the pinyin encoder provides phonetic information. The constructed character variation graph integrates these three kinds of variation relations, which can be significant for camouflaged spam detection.

Variation Family-enhanced Graph Embedding
While the variation graph can provide comprehensive knowledge of Chinese character variations, efforts need to be made to address these two problems: (1) the variation patterns can be very flexible, and the compounded (long-range) variation information transfer may exist. Therefore, short-range (local) graph information, e.g., character vertex's neighbors, may be insufficient for spam detection. Meanwhile, it is impractical to exhaust all the possible variation patterns.
(2) To oblige users to consume the text content, spammers cannot make the variation patterns to be too complex/confusing, they usually focus on the most sensitive words in a spam message. Hence, some random infrequent variation patterns could be "noisy" for CSVD while polluting the detection outcomes.
Latent Character Variation Family. In this study, we propose a VFGE model to address these problems. As depicted in Figure 2, in VFGE model, we introduce a set of latent variables "character variation family" F = F 1 , ..., F |F | at a graph schema (global) level to capture the critical information for spam detection. Each F i is defined as a distribution of characters, which aims to estimate the global frequent variation dependencies in G. By learning F , VFGE is able to highlight the useful variations, eliminate the noisy patterns, and predict the unseen variation forms w.r.t. the spam detection task.
Random Walk based Character Family Representation Co-Learning . VFGE is a random walk based graph embedding model, and we employ a hierarchical random walk strategy (Jiang et al., 2018b) on G to generate the optimized walking paths (character vertex sequences) for each character. The model can sample the most possible variation context vertexes for each character. Based on generated walking paths, VFGE executes the following two processes iteratively: (1) Family Assignment. By leveraging both lo-cal context and global family distributions, we assign a discrete family for each character vertex in a particular walking path to form a character-family pair C, F . As shown in Figure 2, we assume different walking paths tend to emerge various character variation patterns which can be represented as mixtures over latent variation families. Given a character C i in a path, C i has a higher chance to be assigned to a dominant family F i . The assignment probability can be calculated as: As depicted in Figure 2, α is the parameter of the Dirichlet prior on the per-path family distributions (P r(path)); β is the family assignment distribution (P r(C|F )); and θ is the family mixture distribution for a walking path (P r(F |path)). The distribution learning can be considered as a Bayesian inference problem, and we use Gibbs sampling (Porteous et al., 2008) to address this problem.
(2) Character-Family Representation Co-Learning. Given the assigned character-family pairs, the proposed method aims to obtain the representations of character C and latent variation family F by mapping them into a low-dimensional space R d (d is a parameter specifying the number of dimensions). Motivated by (Liu et al., 2015), we propose a novel representation learning method to optimize characters and families separately and simultaneously.
The objective is defined to maximize the following log probability: (2) We use f (·) as the embedding function. C i = f (C i ) represents the character graph embedding and F i = f (F i ) represents the family graph embedding. C F i i denotes the concatenation of C i and F i , whereas N(C i ) is C i 's neighborhood (context). As Figure 2 shows, the feature representation learning method is an upgraded version of the skip-gram architecture. Compared with merely using the target vertex C i to predict context vertexes in original skip-gram model (Mikolov et al., 2013), the proposed approach employs the character-family pair C i , F i to predict context character-family pairs. From variation viewpoint, character vertex's context will encapsulate both local (vertex) and global (variation family) information. Hence, the learned representations are able to comprehensively preserve the variation information in G.
P r( C j , F j |C F i i ) is modeled as a softmax function: Stochastic gradient ascent is used for optimizing the model parameters of f . Negative sampling (Mikolov et al., 2013) is applied for optimization efficiency. Note that, the parameters of each character embedding and family embedding are shared over all the character-family pairs, which, as suggested in (Liu et al., 2015), can address the training data sparseness problem and improve the representation quality.
Family-enhanced Embedding Integration. As shown in Figure 2, the family-enhanced character graph embedding can be calculated as: where G i is family-enhanced graph embedding for C i , and [·] is concatenating operation. P r(F j |C i ) can be inferred from family assignment distribution β.

Enhanced Bidirectional Language Model
As shown in Figure 2, SS model utilizes an enhanced bidirectional language model to jointly learn variation, semantic and contextualized representation of Chinese character. Combination Gate Function. This gate function is utilized for combining the variation and semantic representations, which is the input function for bidirectional language model. The formulations of the gate function are listed as follows: P ∈ R d is the preference weights for controlling the contributions from G ∈ R d (variation graph embedding) and T ∈ R d (Skip-Gram textual embedding). W P ∈ R 2d×d . N ∈ R d is the combination representation. is elementwise product, and + is elementwise sum.  Aggregation Learning Function. With the combination representation N as input, we train a bidirectional language model for capturing the sequential information. There could be multiple layers of forward and backward LSTMs in bidirectional language model. For k th character, − → H k l is the forward LSTM unit's output for layer l, where l = 1, 2, ..., L, and ← − H k l is the output of the backward LSTM unit.
The output SS embedding is learned from an aggregation function, which aims to aggregate the intermediate layer representations of the bidirectional language model and the input embedding N. For k th character, if we denote , the output can be: where ω is the scale parameter, and s l is a weight parameter for the combination of each layer, which can be learned through the training process. Similar aggregation operation has been proven useful to model the contextualized word representation (Peters et al., 2018).

Dataset and Experiment Setting
Dataset 5 . In Table 2, we summarize the statistics of the two real-world spam datasets (in Chinese). One is a SMS dataset, the other is a product review dataset. Both datasets were manually labeled (spam or regular labels) by professionals. False advertising and scam information are the most common forms of spam information for SMS dataset, while abuse information dominates review spam dataset.  In the constructed variation graph, there are totally 25,949 Chinese characters (vertexes) and 7,705,051 variation relations. For all the variation relations, there are 1,508,768 pinyin relations (phonetic), 373,803 stroke relations (glyph), and 5,822,480 Zhengma relations (glyph).
Experimental Set-up. We validated the proposed model in Chinese text spam detection task. In order to simulate the "diversity", "sparseness" and " zero-shot" problems under real business scenarios, we made a challenging restriction on the G : Glyph; P : Phonetic; S : Semantic; C :Context Table 3: Case Study: given the target character, we list the top 3 similar characters from each algorithm. The characters are selected from a frequently used candidate character set whose size is 8238.
training and testing sets, i.e., the character variations were only included in testing set, and all samples in training set were using the original characters.
For the proposed SS model, we utilized the following setting: layers of LSTMs: 2; dimension of hidden (output) state in LSTM: 128; dimension of pre-trained character text embedding: 128; dimension of VFGE embedding: 128; batch size: 64; Dropout: 0.1. For training VFGE embedding 6 , the walk length was 80, the number of walks per vertex was 10. These parameters were adopted in (Peters et al., 2018;Jiang et al., 2018a;Perozzi et al., 2014;Grover and Leskovec, 2016). The variation family number 7 was 500. SS model was pretrained for parameter initialization as suggested in (Peters et al., 2018).
Baselines and Comparison Groups. We chose 13 strong baseline algorithms, from text or graph viewpoints, to comprehensively evaluate the performance of the proposed method.
Comparison Group: we compared the performances of several variants of the proposed method in order to highlight our technical contributions. There were 3 comparison groups conducted. SS Graph : we only used VFGE graph embedding. SS Naive : we simply concatenated VFGE graph embedding and skip-gram textual embedding (a naive version). SS Original : the proposed SS model.
For a fair comparison, the dimension 9 of all embedding models was 128. A single layer of CNN classification model 10 was used for spam detection task.

Experiment Result and Analysis
The text spam detection task performances of different models were reported in Table 1. Based on the experiment results, we had the following ob- (1) SS Original outperformed the baseline models for all evaluation metrics on both datasets, which indicated the proposed SS model can effectively address the CSVD problem.
(2) On review dataset, the leading gap between SS Original and other baselines was greater. A possible explanation was that, the review spam text usually had richer content and more complex variation patterns than SMS spam text. Therefore, a good variation representation model may have certain advantages.
Chinese vs. General.
(1) Compared to classical textual embedding models (Skipgram and GloVe), the Chinese embedding models showed their advantages, especially on review dataset. This result indicated that the characteristic knowledge of Chinese can help to detect spam text. (2) ELMo was able to learn both the semantic and contextualized information, and it achieved a good performance in text baseline group.
Graph vs. Text. Generally, the graph based baselines outperformed the textual based baselines (including general and Chinese). This observation indicated: (1) the variation knowledge of Chinese character can be critical for CSVD problem.
(2) The proposed character variation graph can provide critical information for Chinese character representation learning. (3) Compared to other graph based baselines, SS Graph was superior, which proved the effectiveness of VFGE algorithm, and the proposed variation family can characterize and predict useful variation patterns for CSVD problem.
Chinese Character Encodings.
(1) In Chinese textual embedding baseline group, JWE (radical based) and Cw2vec (stroke based) didn't per-form well, which indicated employing a single kind of glyph-based information can be insufficient for Chinese variation representation learning. Similarly, in graph based baseline group, the performances of M2V P , M2V S and M2V Z (employed only one encoding relation on the constructed graph) were still unsatisfactory. The results revealed that an individual encoding method cannot comprehensively encode a character, we should consider various kinds of variation information simultaneously.
(2) The performance of M2V C (integrated all relations based on a predefined metapath pattern) was still inferior. This result indicated a human-defined rule cannot effectively integrate all relationships in a complex graph.
Representation vs. Spelling Correction. Pycorrector performed poorly in experiment, and other baselines outperformed this approach, which proved the spelling correction method is not capable for CSVD problem.
Variants of SS model. For variants of the proposed method, the results showed that (1) by combining the semantic and sequential information, the task performances can improve; (2) simply concatenating graph and text embeddings cannot generate a satisfactory joint representation. (3) The proposed SS model can successfully capture the variation, semantic, and sequential information for character representation learning.

Case Study
To gain an insightful understanding regarding the variation representation of the proposed method, we conduct qualitative analysis by performing the case studies of character similarities. As shown in Table 3, for exemplary characters, the most similar characters, based on skipgram embedding (general textual based baseline), are all semantically similar or/and context-related. Meanwhile, based on Cw2vec embedding (most recent Chinese embedding baseline), all similar characters for target characters are also semantically similar or/and context-related. Unsurprisingly, for each target character, all similar characters based on VFGE model (best performed graph embedding model), are glyph and phonetic similar characters. The proposed SS model can achieve a comprehensive coverage from variation, semantic and context viewpoints. For instance, in its top 3 similar characters for "运(move)", "转(transmit)" is a semantic and context similar character, and "云(cloud)" is a glyph and phonetic similar character. Furthermore, SS model can capture complicated compound similarity between Chinese characters, for instance, "悚(afraid)" is a glyph, semantic, and context similar character for "惊(shock)". This also explains why SS model performs well to address the CSVD problem. For instance, using glyph variation "江(river)" to replace "红(red)", and glyph-phonetic compound variation "薇(osmund)" to replace "微(micro)". The classical text embedding models may fail to identify this kind of spam texts. With the mining of character variation graph, the graph based approaches can be successful to capture these changes. For spam text without variations, classification models need more semantic and contextual information, and the text based methods can be suitable for this kind of spam texts. The proposed SS model is able to detect both two kinds of spam texts effectively, and experiment results proved SS can successfully model Chinese variational, semantic and contextualized representations for CSVD task.

Conclusion
In this paper, we propose a StoneSkipping model for Chinese spam detection. The performance of the proposed method is comprehensively evaluated in two real world datasets with challenging experimental setting. The results of experiments show that the proposed model significantly outperforms a number of state-of-the-art meth-ods. Meanwhile, the case study empirically proves that the proposed model can successfully capture the Chinese variation, semantic, and contextualized information, which can be essential for CSVD problem. In the future, we will investigate more sophisticated methods to improve SS's performance, e.g., enable self-attention mechanism for contextualized information modelling.