The Language of Legal and Illegal Activity on the Darknet

The non-indexed parts of the Internet (the Darknet) have become a haven for both legal and illegal anonymous activity. Given the magnitude of these networks, scalably monitoring their activity necessarily relies on automated tools, and notably on NLP tools. However, little is known about what characteristics texts communicated through the Darknet have, and how well do off-the-shelf NLP tools do on this domain. This paper tackles this gap and performs an in-depth investigation of the characteristics of legal and illegal text in the Darknet, comparing it to a clear net website with similar content as a control condition. Taking drugs-related websites as a test case, we find that texts for selling legal and illegal drugs have several linguistic characteristics that distinguish them from one another, as well as from the control condition, among them the distribution of POS tags, and the coverage of their named entities in Wikipedia.


Introduction
The Darknet refers to the subset of Internet sites and pages that are not indexed by search engines. Due to the their use of the ".onion" top-level domain, these websites are often referred to as "Onion sites", and are reachable via the Tor network anonymously.
Under the cloak of anonymity, the Darknet harbors much illegal activity (Moore and Rid, 2016). Applying NLP tools to text from the Darknet is thus important for effective law enforcement and intelligence. However, little is known about the characteristics of the language used in the Darknet, and specifically on what distinguishes text on websites that conduct legal and illegal activity. In fact, the only work we are aware of that classified Darknet texts into legal and illegal activity is * Equal contribution Avarikioti et al. (2018), but they too did not investigate in what ways these two classes differ.
This paper addresses this gap, and studies the distinguishing features between legal and illegal texts in Onion sites, taking sites that advertise drugs as a test case. We compare our results to a control condition of texts from eBay 1 pages that advertise products corresponding to drug keywords.
We find a number of distinguishing features. First, we confirm the results of Avarikioti et al. (2018), that text from legal and illegal pages (henceforth, legal and illegal texts) can be distinguished based on the identity of the content words (bag-of-words) in about 90% accuracy over a balanced sample. Second, we find that the distribution of POS tags in the documents is a strong cue for distinguishing between the classes (about 71% accuracy). This indicates that the two classes are different in terms of their syntactic structure. Third, we find that legal and illegal texts are roughly as distinguishable from one another as legal texts and eBay pages are (both in terms of their words and their POS tags). The latter point suggests that legal and illegal texts can be considered distinct domains, which explains why they can be automatically classified, but also implies that applying NLP tools to this domain is likely to face the obstacles of domain adaptation. Indeed, we show that named entities in illegal pages are covered less well by Wikipedia, i.e., Wikification works less well on them. This suggests that for high-performance text understanding, specialized knowledge bases and tools may be needed for processing texts from the Darknet. By experimenting on a different domain in Tor (user-generated content), we show that the legal/illegal distinction generalizes across domains. 2 After discussing previous works in Section 2, we detail the datasets used in Section 3. Differences in the vocabulary and named entities between the classes are analyzed in Section 4, before the presentation of the classification experiments (Section 5). Section 6 presents additional experiments, which explore cross-domain classification. We further analyze and discuss the findings in Section 7.

Related Work
The detection of illegal activities on the Web is sometimes derived from a more general topic classification. For example, Biryukov et al. (2014) classified the content of Tor hidden services into 18 topical categories, only some of which correlate with illegal activity. Graczyk and Kinningham (2015) combined unsupervised feature selection and an SVM classifier for the classification of drug sales in an anonymous marketplace. While these works classified Tor texts into classes, they did not directly address the legal/illegal distinction.
Some works directly addressed a specific type of illegality and a particular communication context. Morris and Hirst (2012) used an SVM classifier to identify sexual predators in chatting message systems. The model includes both lexical features, including emoticons, and behavioral features that correspond to conversational patterns. Another example is the detection of pedophile activity in peer-to-peer networks (Latapy et al., 2013), where a predefined list of keywords was used to detect child-pornography queries. Besides lexical features, we here consider other general linguistic properties, such as syntactic structure.
Al Nabki et al. (2017) presented DUTA (Darknet Usage Text Addresses), the first publicly available Darknet dataset, together with a manual classification into topical categories and subcategories. For some of the categories, legal and illegal activities are distinguished. However, the automatic classification presented in their work focuses on the distinction between different classes of illegal activity, without addressing the distinction between legal and illegal ones, which is the subject of the present paper. Al Nabki et al. (2019) extended the dataset to form DUTA-10K, which sification) are included in the supplementary material, and can be found in https://github.com/huji-nlp/ cyber. we use here. Their results show that 20% of the hidden services correspond to "suspicious" activities. The analysis was conducted using the text classifier presented in Al Nabki et al. (2017) and manual verification.
Recently, Avarikioti et al. (2018) presented another topic classification of text from Tor together with a first classification into legal and illegal activities. The experiments were performed on a newly crawled corpus obtained by recursive search. The legal/illegal classification was done using an SVM classifier in an active learning setting with bag-of-words features. Legality was assessed in a conservative way where illegality is assigned if the purpose of the content is an obviously illegal action, even if the content might be technically legal. They found that a linear kernel worked best and reported an F1 score of 85% and an accuracy of 89%. Using the dataset of Al Nabki et al. (2019), and focusing on specific topical categories, we here confirm the importance of content words in the classification, and explore the linguistic dimensions supporting classification into legal and illegal texts.

Datasets Used
Onion corpus. We experiment with data from Darknet websites containing legal and illegal activity, all from the DUTA-10K corpus (Al Nabki et al., 2019). We selected the "drugs" sub-domain as a test case, as it is a large domain in the corpus, that has a "legal" and "illegal" sub-categories, and where the distinction between them can be reliably made. These websites advertise and sell drugs, often to international customers. While legal websites often sell pharmaceuticals, illegal ones are often related to substance abuse. These pages are directed by sellers to their customers. eBay corpus. As an additional dataset of similar size and characteristics, but from a clear net source, and of legal nature, we compiled a corpus of eBay pages. eBay is one of the largest hosting sites for retail sellers of various goods. Our corpus contains 118 item descriptions, each consisting of more than one sentence. Item descriptions vary in price, item sold and seller. The descriptions were selected by searching eBay for drug related terms, 3 and selecting search patterns to avoid over-repetition. For example, where many sell the same product, only one example was added to the corpus. Search queries also included filtering for price, so that each query resulted with different items. Either because of advertisement strategies or the geographical dispersion of the sellers, the eBay corpus contains formal as well as informal language, and some item descriptions contain abbreviations and slang. Importantly, eBay websites are assumed to conduct legal activity-even when discussing drug-related material, we find it is never the sale of illegal drugs but rather merchandise, tools, or otherwise related content.
Cleaning. As preprocessing for all experiments, we apply some cleaning to the text of web pages in our corpora. HTML markup is already removed in the original datasets, but much non-linguistic content remains, such as buttons, encryption keys, metadata and URLs. We remove such text from the web pages, and join paragraphs to single lines (as newlines are sometimes present in the original dataset for display purposes only). We then remove any duplicate paragraphs, where paragraphs are considered identical if they share all but numbers (to avoid an over-representation of some remaining surrounding text from the websites, e.g. "Showing all 9 results").

Domain Differences
As pointed out by Plank (2011), there is no common ground of what constitutes a domain. Domain differences are attributed in some works to differences in vocabulary (Blitzer et al., 2006) and in other works to differences in style, genre and medium (McClosky, 2010). While here we adopt an existing classification, based on the DUTA-10K corpus, we show in which way and to what extent it translates to distinct properties of the texts. This question bears on the possibility of distinguishing between legal and illegal drug-related websites based on their text alone (i.e., without recourse to additional information, such as meta-data or network structure). We examine two types of domain differences between legal and illegal texts: vocabulary differences and named entities.

Vocabulary Differences
To quantify the domain differences between texts from legal and illegal texts, we compute the frequency distribution of words in the eBay corpus, the legal and illegal drugs Onion corpora, and the entire Onion drug section (All Onion). Since any two sets of texts are bound to show some disparity between them, we compare the differences between domains to a control setting, where we randomly split each examined corpus into two halves, and compute the frequency distribution of each of them. The inner consistency of each corpus, defined as the similarity of distributions between the two halves, serves as a reference point for the similarity between domains. We refer to this measure as "self-distance".
Following Plank and van Noord (2011), we compute the Jensen-Shannon divergence and Variational distance (also known as L1 or Manhattan) as the comparison measures between the word frequency histograms.  Table 1 presents our results. The self-distance of the eBay, Legal Onion and Illegal Onion corpora lies between 0.40 to 0.45 by the Jensen-Shannon divergence, but the distance between each pair is 0.60 to 0.65, with the three approximately forming an equilateral triangle in the space of word distributions. Similar results are obtained using Variational distance, and are omitted for brevity.
These results suggest that rather than regarding all drug-related Onion texts as one domain, with legal and illegal texts as sub-domains, they should be treated as distinct domains. Therefore, using Onion data to characterize the differences between illegal and legal linguistic attributes is sensible. In fact, it is more sensible than comparing Illegal Onion to eBay text, as there the legal/illegal distinction may be confounded by the differences between eBay and Onion data.

Differences in Named Entities
In order to analyze the difference in the distribution of named entities between the domains, we used a Wikification technique (Bunescu and Paşca, 2006), i.e., linking entities to their corresponding article in Wikipedia.
Using spaCy's 4 named entity recognition, we first extract all named entity mentions from all the corpora. We then search for relevant Wikipedia entries for each named entity using the DBpedia Ontology API (Daiber et al., 2013). For each domain we compute the total number of named entities and the percentage with corresponding Wikipedia articles.
The results were obtained by averaging the percentage of wikifiable named entities in each site per domain. We also report the standard error for each average. According to our results (Table 2), the Wikification success ratios of eBay and Illegal Onion named entities is comparable and relatively low. However, sites selling legal drugs on Onion have a much higher Wikification percentage.
Presumably the named entities in Onion sites selling legal drugs are more easily found in public databases such as Wikipedia because they are mainly well-known names for legal pharmaceuticals. However, in both Illegal Onion and eBay sites, the list of named entities includes many slang terms for illicit drugs and paraphernalia. These slang terms are usually not well known by the general public, and are therefore less likely to be covered by Wikipedia and similar public databases.
In addition to the differences in Wikification ratios between the domains, it seems spaCy had trouble correctly identifying named entities in both Onion and eBay sites, possibly due to the common use of informal language and drugrelated jargon. Eyeballing the results, there were a fair number of false positives (words and phrases that were found by spaCy but were not actually named entities), especially in Illegal Onion sites. In particular, slang terms for drugs, as well as abbreviated drug terms, for example "kush" or "GBL", were being falsely picked up by spaCy. 5 To summarize, results suggest both that (1) legal and illegal texts are different in terms of their named entities and their coverage in Wikipedia, as well as that (2) standard NLP tools for named entity recognition (and potentially other text understanding tasks), and standard databases, require considerable adaptation to be fully functional on text related to illegal activity.

Classification Experiments
Here we detail our experiments in classifying text from different legal and illegal domains using various methods, to find the most important linguistic features distinguishing between the domains. Another goal of the classification task is to confirm our finding that the domains are distinguishable.
Experimental setup. We split each subset among {eBay, Legal Onion, Illegal Onion} into training, validation and test. We select 456 training paragraphs, 57 validation paragraphs and 58 test paragraphs for each category (approximately a 80%/10%/10% split), randomly downsampling larger categories for an even division of labels.
Model. To classify paragraphs into categories, we experiment with five classifiers: • NB (Naive Bayes) classifier with binary bagof-words features, i.e., indicator feature for each word. This simple classifier features frequently in work on text classification in the DarkNet.
• SVM (support vector machine) classifier with an RBF kernel, also with BoW features that count the number of words of each type.
• BoE (bag-of-embeddings): each word is represented with its 100-dimensional GloVe vector (Pennington et al., 2014). BoE sum (BoE average ) sums (averages) the embeddings for all words in the paragraph to a single vector, and applies a 100-dimensional fully-connected layer with ReLU nonlinearity and dropout p = 0.2. The word vectors are not updated during training.
• seq2vec: same as BoE, but instead of averaging word vectors, we apply a BiLSTM to the word vectors, and take the concatenated final hidden vectors from the forward and backward part as the input to a fully-connected layer (same hyper-parameters as above).
• attention: we replace the word representations with contextualized pre-trained representations from ELMo . We then apply a self-attentive classification network (McCann et al., 2017) over the contextualized representations. This architecture has proved very effective for classification in recent work (Tutek and Šnajder, 2018;Shen et al., 2018).
For the NB classifier we use BernoulliNB from scikit-learn 6 with α = 1, and for the SVM classifier we use SVC, also from scikit-learn, with γ = scale and tolerance=10 −5 . We use the AllenNLP library 7  to implement the neural network classifiers.
Data manipulation. In order to isolate what factors contribute to the classifiers' performance, we experiment with four manipulations to the input data (in training, validation and testing). Specifically, we examine the impact of variations in the content words, function words and shallow syntactic structure (represented through POS tags). For this purpose, we consider content words as words whose universal part-of-speech according to spaCy is one of the following: Results when applying these manipulations are compared to the full condition, where all words are available.
Settings. We experiment with two settings, classifying paragraphs from different domains: • Training and testing on eBay pages vs. Legal drug-related Onion pages, as a control experiment to identify whether Onion pages differ from clear net pages.
• Training and testing on Legal Onion vs. Illegal Onion drugs-related pages, to identify the difference in language between legal and illegal activity on Onion drug-related websites.

Results
The accuracy scores for the different classifiers and settings are reported in Table 3.  the SVM classifier, suggesting that the domains are distinguishable (albeit to a lesser extent) based on the function word distribution alone. Surprisingly, the more sophisticated neural classifiers perform worse than Naive Bayes. This is despite using pretrained word embeddings, and architectures that have proven beneficial for text classification. It is likely that this is due to the small size of the training data, as well as the specialized vocabulary found in this domain, which is unlikely to be supported well by the pretrained embeddings (see §4.2).
Legal vs. illegal drugs. Classifying legal and illegal pages within the drugs domain on Onion proved to be a more difficult task. However, where content words are replaced with their POS tags, the SVM classifier distinguishes between legal and illegal texts with quite a high accuracy (70.7% on a balanced test set). This suggests that the syntactic structure is sufficiently different between the domains, so as to make them distinguishable in terms of their distribution of grammatical categories.

Illegality Detection Across Domains
To investigate illegality detection across different domains, we perform classification experiments on the "forums" category that is also separated  into legal and illegal sub-categories in DUTA-10K. The forums contain user-written text in various topics. Legal forums often discuss web design and other technical and non-technical activity on the internet, while illegal ones involve discussions about cyber-crimes and guides on how to commit them, as well as narcotics, racism and other criminal activities. As this domain contains usergenerated content, it is more varied and noisy.

Experimental setup
We use the cleaning process described in Section 3 and data splitting described in Section 5, with the same number of paragraphs. We experiment with two settings: • Training and testing on Onion legal vs. illegal forums, to evaluate whether the insights observed in the drugs domain generalize to user-generated content.
• Training on Onion legal vs. illegal drugsrelated pages, and testing on Onion legal vs. illegal forums. This cross-domain evaluation reveals whether the distinctions learned on the drugs domain generalize directly to the forums domain.

Results
Accuracy scores are reported in Table 4.  Legal vs. illegal forums. Results when training and testing on forums data are much worse for the neural-based systems, probably due to the much noisier and more varied nature of the data. However, the SVM model achieves an accuracy of 85.3% in the full setting. Good performance is presented by this model even in the cases where the content words are dropped (drop. cont.) or replaced by part-of-speech tags (pos cont.), underscoring the distinguishability of legal in illegal content based on shallow syntactic structure in this domain as well.
Cross-domain evaluation. Surprisingly, training on drugs data and evaluating on forums performs much better than in the in-domain setting for four out of five classifiers. This implies that while the forums data is noisy, it can be accurately classified into legal and illegal content when training on the cleaner drugs data. This also shows that illegal texts in Tor share common properties regardless of topical category. The much lower results obtained by the models where content words are dropped (drop cont.) or converted to POS tags (pos cont.), namely less than 70% as opposed to 89.7% when function words are dropped, suggest that some of these properties are lexical.

Discussion
As shown in Section 4, the Legal Onion and Illegal Onion domains are quite distant in terms of word distribution and named entity Wikification. Moreover, named entity recognition and Wikification work less well for the illegal domain, and so do state-of-the-art neural text classification architectures (Section 5), which present inferior results to simple bag-of-words model. This is likely a result of the different vocabulary and syntax of text from Onion domain, compared to standard  domains used for training NLP models and pretrained word embeddings. This conclusion has practical implications: to effectively process text in Onion, considerable domain adaptation should be performed, and effort should be made to annotate data and extend standard knowledge bases to cover this idiosyncratic domain. Another conclusion from the classification experiments is that the Onion Legal and Illegal Onion texts are harder to distinguish than eBay and Legal Onion, meaning that deciding on domain boundaries should consider syntactic structure, and not only lexical differences.
Analysis of texts from the datasets. Looking at specific sentences ( Figure 1) reveals that Legal Onion and Illegal Onion are easy to distinguish based on the identity of certain words, e.g., terms for legal and illegal drugs, respectively. Thus looking at the word forms is already a good solution for tackling this classification problem, which gives further insight as to why modern text classification (e.g., neural networks) do not present an advantage in terms of accuracy.
Analysis of manipulated texts. Given that replacing content words with their POS tags substantially lowers performance for classification of legal vs illegal drug-related texts (see "pos cont." in Section 5), we conclude that the distribution of parts of speech alone is not as strong a signal as the word forms for distinguishing between the domains. However, the SVM model does manage to distinguish between the texts even in this setting. Indeed, Figure 2 demonstrates that there are easily identifiable patterns distinguishing between the domains, but that a bag-of-words approach may not be sufficiently expressive to identify them.
Analysis of learned feature weights. As the Naive Bayes classifier was the most successful at distinguishing legal from illegal texts in the full setting (without input manipulation), we may conclude that the very occurrence of certain words provides a strong indication that an instance is taken from one class or the other. Table 5 shows the most indicative features learned by the Naive Bayes classifier for the Legal Onion vs. Illegal Onion classification in this setting. Interestingly, many strong features are function words, providing another indication of the different distribution of function words in the two domains.