An Entity Resolution Approach to Isolate Instances of Human Trafficking Online

Human trafficking is a challenging law enforcement problem, and traces of victims of such activity manifest as ‘escort advertisements’ on various online forums. Given the large, heterogeneous and noisy structure of this data, building models to predict instances of trafficking is a convoluted task. In this paper we propose an entity resolution pipeline using a notion of proxy labels, in order to extract clusters from this data with prior history of human trafficking activity. We apply this pipeline to 5M records from backpage.com and report on the performance of this approach, challenges in terms of scalability, and some significant domain specific characteristics of our resolved entities.


Introduction
Over the years human trafficking has grown to be a challenging law enforcement problem.The advent of the internet has brought the problem in the public domain making it an ever greater societal concern.Prior studies (Kennedy, 2012) have leveraged computational techniques to this data to detect spatio-temporal patterns, by utilizing certain features of the ads.Certain studies (Dubrawski et al., 2015) have utilized machine learning approaches to identify if ads could be possibly involved in human trafficking activity.Significant work has also been carried out in building large distributed systems, to store and process such data, and carry out entity resolution to establish ontological relationships between various entities.(Szekely et al., 2015) In this paper we explore the possibility of leveraging this information to identify sources of these advertisements, isolate such clusters and identify potential sources of human trafficking from this data using prior domain knowledge.
In case of ordinary Entity Resolution schemes, each record is considered to represent a single entity.A popular approach in such scenarios is a 'merge and purge' strategy whereas records are compared and matched, they are merged into a single more informative record, and the individual records are deleted from the dataset.(Benjelloun et al., 2009) While our problem can be considered a case of Entity Resolution, however, escort advertisements are a challenging, noisy and unstructured dataset.In case of escort advertisements, a single advertisement, may represent one or a group of entities.The advertisements hence might contain features belonging to more than one individual or group.
The advertisements are also associated with multiple features, including Text, Hyperlinks, Images, Timestamps, Locations etc.In order to featurize characteristics from text we use the regex based information extractor based on the GATE framework (Cunningham, 2002).This allows us to generate certain domain specific features from our dataset, including, the aliases, cost, location, phone numbers, specific URLs, etc of the entities advertised.We use these features, along with other generic text, the images, etc as features for our classifier.The high reuse of similar features makes it difficult to use exact match over a single feature in order to perform entity resolution.
We proceed to leverage machine learning approaches to learn a function that can predict if two advertisements are from the same source.The challenge with this is that we have no prior knowledge of the source of advertisements.We thus depend upon a strong feature, in our case Phone Numbers, which can be used as proxy evidence for the source of the advertisements and can help us generate labels for the Training and Test data for a classifier.We can therefore use such strong evidence as to learn another function, which can help us generate labels for our dataset, this semi-supervised approach is described as 'surrogate learning' in (Veeramachaneni and Kondadadi, 2009).Pairwise comparisons result in an extremely high number of comparisons over the entire dataset.In order to reduce this, we use a blocking scheme using certain features.
The resulting clusters are isolated for human trafficking using prior expert knowledge and featurized.Rule learning is used to establish differences between these and other components.The entire pipeline is represented by Figure 2.

Domain and Feature Extraction
Figure 1 is illustrative of the search results of escort advertisements and a page advertising a particular individual.The text is inundated with special characters, Emojis, as well as misspelled words that are specific markers and highly informative to domain experts.the text consists of information, regarding the escorts area of operation, phone number, any particular client preferences, and the advertised cost.We proceed to build Regular expression based feature extractors to extract this information and store in a fixed schema, using the popular JAPE tool part of the GATE suite of NLP tools.The extractor we build for this domain, AnonymousExtractor is open source and publically available at github.com/mille856/CMU_memex.
Table 1 lists the performance of our extraction tool on 1,000 randomly sampled escort advertise-  (Dubrawski et al., 2015) for a complete description of the fields extracted.)The noisy nature, along with intentional obfuscations, especially in case of features like Names results in lower performance as compared to the other extracted features.
Apart from the Regular Expression based features, we also extract the hashcodes of the images in the advertisements, the posting date and time, and location.We approach the problem of extracting connected components from our dataset using pairwise entity resolution.The similarity or connection between two nodes is treated as a learning problem, with training data for the problem generated by using 'proxy' labels from existing evidence of connectivity from strong features.
More formally the problem can be considered to be to sample all connected components H i (V, E) from a graph G(V, E).Here, V, the set of vertices ({v 1 , v 2 , ..., v n }) is the set of advertisements and E, {(v i , v j ), (v j , v k ), ..., (v k , v l )} is the set of edges between individual records, the presence of which indicates they represent the same entity.
We need to learn a function The set of strong features present in a given record can be considered to be the function 'S'.Thus, in our problem, S v represents all the phone numbers associated with v.
Thus S = S v i , ∀v i ∈ V. Here, |S| << |V| Now, let us further consider the graph G * (V, E) defined on the set of vertices V, such that more simply, the graph described by strong features.) Let H * be the set of all the of connected components Now, function P is such that for any

Sampling Scheme
For our classifier we need to generate a set of training examples 'T ', and T pos & T neg are the subsets of samples labeled positive and negative.
In order to ensure that the sampling scheme does not end up sampling near duplicate pairs, we introduce a sampling bias such that for every feature vector This reduces the likelihood of sampling nearduplicates as evidenced in Figure 4, which is a histogram of the Jaccards Similarity between the set of the unigrams of the text contained in the pair of ads.
We observe that although we do still end with some near duplicates (sim > 0.9), we have high number of non duplicates.(0.1 < sim < 0.3) which ensures robust training data for our classifier.

Training
To train our classifier we experiment with various classifiers like Logistic Regression, Naive Bayes and Random Forest using Scikit.(Pedregosa et al., 2011) Table 2 shows the most informative features learnt by the Random Forest classifier.It is interesting to note that the most informative features include, the spatial (Location), Temporal (Time Difference, Posting Date) and also the Linguistic (Number of Special Characters, Longest Common Substring) features.We also find that the domain specific features, extracted using regexs, prove to be informative.The ROC curves for the classifiers we tested with different feature sets are presented in Figure 5.The classifiers performs well, with extremely low false positive rates.Such a behavior is desirable for the classifier to act as a match func- tion, in order to generate sensible results for the downstream tasks.High False Positive rates, increase the number of links between our records, leading to a 'snowball effect' which results in a break-down of the downstream Entity Resolution process as evidenced in Figure 6.
In order to minimize this breakdown, we need to heuristically learn an appropriate confidence value for our classifier.This is done by carrying out the ER process on 10,000 randomly selected records from our dataset.The value of size of the largest extracted connected component and the number of such connected components isolated is calculated for different confidence values of our classifier.This allows us to come up with a sensible heuristic for the confidence value.

Blocking Scheme
Our dataset consists of over 5 million records.Naive pairwise comparisons across the dataset, makes this problem computationally intractable.In order to reduce the number of comparisons,

Rule Learning
We extract clusters and identify records that are associated with human trafficking using domain knowledge from experts.We featurize the extracted components, using features like size of the cluster, the spatio-temporal characteristics, and the connectivity of the clusters.For our analysis, we consider only components with more than 300 advertisements.we then train a random forest to predict if the clusters is linked to human trafficking.In order to establish statistical significance, we compare the ROC results of our classifier in 4 cross validation for 100 random connected components versus the positive set. Figure 9 & Table 4 lists the performance of the classifier in terms of False Positive and True Positive Rate while Table 5 lists the most informative features for this classifier.
We then proceed to learn rules from our featureset.Some of the rules with corresponding Ratios and Lift are given in Table 3. PN curves corresponding to various rules learnt are presented in the Figure 10 It can be observed that the features used by the rule learning to learn rules with maximum support and ratios, correspond to the ones labeled by the random forest as informative.This also serves as validation for the use of rule learning.(c) While this pair is not extremely similar in terms of language, however the existence of the rare alias 'SierraDayna' in both advertisemets helps the classifier in matching them.This match can also easily be verified by the similar language structure of the pair.
(d) The first advertisement represents entities 'Black China' and 'Star Quality', while the second advertisement, reveals that the pictures used in the first advertisement are not original and belong to the author of the second ad.This example pair shows the robustness of our match function.It also reveals how complicated relationships between various ads can be.This entity has 802 nodes and 39,383 edges.This visualization is generated using Gephi.(Bastian et al., 2009).This entity operated in cities, across states and advertised multiple different individuals along with multiple phone numbers.This suggests a more complicated and organised activity and serves as an example of how complicated certain entities can be in this trade. .

Conclusion
In this paper we approached the problem of isolating sources of human trafficking from online escort advertisements with a pairwise Entity Resoltuion approach.We trained a classifier able to predict if two advertisements are from the same source using phone numbers as a strong feature and exploit it as proxy ground truth to generate training data for our classifier.The resultant classifier, proved to be robust, as evidenced from extremely low false positive rates.Other approraches (Szekely et al., 2015) aims to build similar knowledge graphs using similarity score between each feature.This has some limitations.Firstly, we need labelled training data inorder to train match functions to detect ontological relations.The challenge is aggravated since this approach considers each feature independently making generation of enough labelled training data for training multiple match functions an extremely complicated task.
Since we utilise existing features as proxy evidence, our approach can generate a large amount of training data without the need of any human annotation.Our approach requires just learning a single function over the entire featureset, hence our classifier can learn multiple complicated relations between features to predict a match, instead of the naive feature independence assumption.
We then proceeded to use this classifier in order to perform entity resolution using a heurestically learned value for the score of classifier, as the match threshold.The resultant connected components were again featurised, and a classifier model was fit before subjecting to rule learning.On comparison with (Dubrawski et al., 2015), the connected component classifier performs a little better with higher values of the area under the ROC curve and the TPR@FPR=1% indicating a steeper, ROC curve.We hypothesize that due to the entity resolution process, we are able to generate larger, more robust amount of training data which is immune to the noise in labelling and results in a stronger classifier.The learnt rules show high ratios and lift for reasonably high supports as evidenced from Table 3. Rule learning also adds an element of interpretability to the models we built, and as compared to more complex ensemble methods like Random Forests, having hard rules as classification models are preferred by Domain Experts to build evidence for incrimination.

Future Work
While our blocking scheme performs well to reduce the number of comparisons, however since our approach involves naive pairwise comparisons, scalability is a significant challenge.One approach could be to design such a pipeline in a distributed environment.Another approach could be to use a computationally inexpensive technique to de-duplicate the dataset of the near duplicate ads, which would greatly help with regard to scalability.
In our approach, the ER process depends upon the heuristically learnt match threshold.Lower threshold values can significantly degrade the performance, with extremely large connected components.The possibility of treating this attribute as a learning task, would help making this approach more generic, and non domain specific.
Hashcodes of the images associated with the ads were also utilized as a feature for the match function.However, simple features like number of unique and common images etc., did not prove to be very informative.Further research is required in order to make better use of such visual data.
Figure 1: Escort advertisements are a classic source of what can be described as Noisy Text.Notice the excessive use of Emojis, Intentional misspelling and relatively benign colloquialisms to obfuscate a more nefarious intent.Domain experts extract meaningful cues from the spatial and temporal indicators, and other linguistic markers to suspect trafficking activity, which further motivate the leveraging of computational approaches to support such decision making.

Figure
Figure 2: The proposed Entity Resolution pipeline

Figure 3 :
Figure 3: On applying our match function, weak links are generated for classifier scores above a certain match threshold.The strong links between nodes are represented by Solid Lines.Dashed lines represent the weak links generated by our classifier.

Figure 4 :
Figure4: Text Similarity for our Sampling Scheme.We use Jaccards Similarity between the ad unigrams as a measure of text similarity.The histogram shows that the sampling scheme results in both, a large number of near duplicates and non duplicates.Such a behavior is desired to ensure a robust match function.

Figure 5 :
Figure 5: ROC Curves for our Match Function trained on various feature sets.The ROC curve shows reasonably large True Positive rates for extremely low False Positive rates, which is a desirable behaviour of the match function.

Figure 6 :
Figure 6: The plots represents the number of connected components and the size of the largest component versus the match threshold.

Figure 9 :Figure 10 :
Figure 9: ROC for the Connected Component classifier.The Black line is the positive set, while the Red line is the average ROC for 100 randomly guessed predictors.
(a) This pair of ads have extremely similar textual content including use of non-latin and special characters.The ad also advertises the same individual, as strongly evidenced by the common alias, 'Paris'.(b)The first ad here does not include any specific names of individuals.However, The strong textual similarity with the second ad and the same advertised cost, helps to match them and discover the individuals being advertised as 'Nick' and 'Victoria'.

Figure 8 :Figure 11 :
Figure 8: Representative results of advertisement pairs matched by our classifier.In all the four cases the advertisement pairs had no phone number information (strong feature) in order to detect connections.Note that sensitive elements have been intentionally obfuscated.

Table 1 :
Performance of TJBatchExtractor

Table 2 :
Most Informative Features

Table 3 :
Results Of Rule Learning we introduce a blocking scheme and performa exhaustive pairwise comparisons only within each block before resolving the dataset across blocks.We block the dataset on features like Rare Unigrams, Rare Bigrams and Rare Images.

Table 4 :
Metrics for the Connected Component classifier