A Computational Analysis of the Language of Drug Addiction

We present a computational analysis of the language of drug users when talking about their drug experiences. We introduce a new dataset of over 4,000 descriptions of experiences reported by users of four main drug types, and show that we can predict with an F1-score of up to 88% the drug behind a certain experience. We also perform an analysis of the dominant psycholinguistic processes and dominant emotions associated with each drug type, which sheds light on the characteristics of drug users.


Introduction
The World Drug Report globally estimated that in 2012, between 162 million and 324 million people, corresponding to between 3.5 per cent and 7.0 per cent of the world population aged 15-64, had used an illicit drug (United Nations Office, 2014). Moreover, in recent years, drug users have started to share their experiences on Web forums. 1 The availability of this new and very large form of data presents new opportunities to analyse and understand the "drug use phenomenon." Recent studies have shown how by processing these data with language processing techniques, it is possible to perform tasks of toxicovigilance, e.g., finding new drugs trends, adverse reactions, geographic and demographic characterizations (Chary et al., 2013). Other studies have also focused on the phenomenon of intoxication (Schuller et al., 2014). However, despite the interest around these topics, as far as we know, textual corpora of drug addicts experiences are not yet available.
In this paper we introduce a corpus that can be exploited as a basis for a number of computational explorations on the language of drug users. One of the most controversial and interesting issues in addictionology studies is to understand why drug consumers prefer a particular type of drug over another. Actually differentiating drugs with respect to their subjective effects can have an important impact on clinical drug treatment, since it can allow clinicians to better characterize the patient in therapy, with regard to the effect they seek through the drugs they use.
The paper is organized as follows. We first review the related work, followed by a description of the dataset of drug addict experiences that we constructed. Next, we present a classification experiment on predicting the drug behind an experience. We then present specific analyses of the language of drug users, i.e. their psycholinguistic processes and the emotions associated with an experience. Lastly, we conclude the paper and present some directions for future work.

Related Work
An important research on texts from social media was the platform PreDOSE (Cameron et al., 2013), designed to facilitate the epidemiological study of prescription (and related) drug abuse practices, or its successors: eDrugTrends 2 and iN3. 3 Another significant work was that of Paul and Dredze (2012;. They developed a new version of Blei's LDA, factorial LDA, and for each drug, they were able to collect multiple topics (route of administration, culture, chemistry, etc.) over posts collected from the website www.drugs-forum.com. The main directions of research on the state of consciousness are focused on alcoholic intoxication and mostly performed on the Alcohol Language Corpus (Schiel et al., 2012), only available in German: for example, speech analysis (Wang et al., 2013;Bone et al., 2014) and a text based system (Jauch et al., 2013) were used to analyse this data. Regarding alcohol intoxication detection, (Joshi et al., 2015) developed a system for automatic detection of drunk people by using their posts on Twitter. (Bedi et al., 2014) performed their analysis on transcriptions from a free speech task, in which the participants were volunteers previously administered with a dose of MDMA (3,4methylenedioxy-methamphetamine). Even if this is an ideal case study for analyzing cognitively the intoxication state, it is difficult to replicate on a large scale. Finally, as far as we know, the only attempt to classify and characterize experiences over different kinds of drugs was the project of (Coyle et al., 2012). Using a random-forest classifier over 1,000 random-collected reports of the website www.erowid.org they identified subsets of words differentiated by drugs.
Our research is also related to the broad theme of latent user attribute prediction, which is an emerging task within the natural language processing community, having recently been employed in fields such as public health (Coppersmith et al., 2015) and politics (Conover et al., 2011;Cohen and Ruths, 2013). Some of the attributes targeted for extraction focus on demographic related information, such as gender/age (Koppel et al., 2002;Mukherjee and Liu, 2010;Burger et al., 2011;Van Durme, 2012;, race/ethnicity (Pennacchiotti and Popescu, 2011;Eisenstein et al., 2011;Rao et al., 2011;, location (Bamman et al., 2014), yet other aspects are mined as well, among them emotion and sentiment , personality types (Schwartz et al., 2013;, user political affiliation (Cohen and Ruths, 2013;Volkova and Durme, 2015), mental health diagnosis (Coppersmith et al., 2015) and even lifestyle choices such as coffee preference (Pennacchiotti and Popescu, 2011). The task is typically approached from a machine learning perspective, with data originating from a variety of user generated content, most often microblogs (Pennacchiotti and Popescu, 2011;Coppersmith et al., 2015;, article com-ments to news stories or op-ed pieces (Riordan et al., 2014), social posts (originating from sites such as Facebook, MySpace, Google+) (Gong et al., 2012), or discussion forums on particular topics (Gottipati et al., 2014). Classification labels are then assigned either based on manual annotations , self identified user attributes (Pennacchiotti and Popescu, 2011), affiliation with a given discussion forum type, or online surveys set up to link a social media user identification to the responses provided (Schwartz et al., 2013). Learning has typically employed bagof-words lexical features (ngrams) (Van Durme, 2012;Filippova, 2012;Nguyen et al., 2013), with some works focusing on deriving additional signals from the underlying social network structure (Pennacchiotti and Popescu, 2011;Yang et al., 2011;Gong et al., 2012;Volkova and Durme, 2015), syntactic and stylistic features (Bergsma et al., 2012), or the intrinsic social media generation dynamic (Volkova and Durme, 2015). We should note that some works have also explored unsupervised approaches for demographic dimensions extraction, among them large-scale clustering (Bergsma et al., 2013) and probabilistic graphical models (Eisenstein et al., 2010).
In the scientific literature about drug users, "purists" (i.e., consumers of only one specific substance) are rare. Nonetheless, when collecting the data, we decided to consider only reports describing one single drug in order to avoid the presence of a report in multiple categories, as well as to avoid descriptions of the interaction of multiple drugs, which are hard to characterize and still mostly unknown.

Predicting the Drug behind an Experience
To determine if an automatic classifier is able to identify the drug behind a certain reported experience, we create a document classification task using Multinomial Naïve Bayes, and use the default information gain feature weighting associated with this classifier. Each document corresponds to a report labelled with its corresponding drug category. Only minimal preprocessing was applied, i.e., part-of-speech tagging and lemmatization. No particular feature selection was performed, only stopwords were removed, keeping nouns, adjectives, verbs, and adverbs. Since the major class in the experiment was the hallucinogens category, we set the baseline corresponding to its percentage: 61%. In evaluating the system we perform a five-fold cross-validation, with an overall F1-score (micro-average) of 88%, indicating that good separation can be obtained by an automatic classifier (see Table 3). Not surprisingly, the hallucinogen experiences are the easiest to classify, probably due to the larger amount of data available for this drug. Table 4 shows a sample of the most informative features for the four categories. For example, we can observe that those using emphatogens are more "night"-oriented, while those addicted to sedatives and stimolants are "day"-oriented. Instead, the use of hallucinogens seems to be associated with a perceptual visual experience (i.e., see#v).

Psycholinguistic Processes
To gain a better understanding of the characteristics of drug users, we analyse the distribution of psycholinguistic word classes according to the Linguistic Inquiry and Word Count (LIWC) lexicon -a resource developed by Pennebaker and colleagues (Pennebaker and Francis, 1999). The 2015 version of LIWC includes 19,000 words and word stems grouped into 73 broad categories relevant to psychological processes. The LIWC lexicon has been validated by showing significant correlation between human ratings of a large number of written texts and the rating obtained through LIWC-based analyses of the same texts.
For each drug type T , we calculate the dominance score associated with each LIWC class C (Mihalcea and Strapparava, 2009). This score is calculated as the ratio between the percentage of words that appear in T and belong to C, and the percentage of words that appear in any other drug type but T and belong to C. A score significantly higher than 1 indicates a LIWC class that is dominant for the drug type T , and thus likely to be a characteristic of the experiences reported by users of this drug. Table 5 shows the top five dominant psycholinguistic word classes associated with each drug type. Interestingly, descriptions of experiences reported by users of empathogens are centered around people (e.g., Affiliation -which includes words such as club, companion, collaborate; We; Friend). Hallucinogens result in experiences that relate to the human senses (e.g., See, Hear, Perception). The experiences of users of sedatives and stimulants appear to be more concerned with mundane topics (e.g., Money, Work, Health).
To quantify the similarity of the distributions Drug Type Example EMP I found myself witnessing an argument between a man and a woman whom I've never met. I felt empathetic towards both of them, recognizing their struggle, he meant well, but couldn't find the right words, she, obviously cared a great deal for him but was doubtful of his intentions.   EMP experience#n good#a pill#n people#n about#r drug#n night#n start#v HAL see#v experience#n trip#n look#v back#r say#v try#v down#r as#r SED day#n drug#n start#v about#r try#v good#a hour#n still#r effect#n STI day#n drug#n coke#n good#a try#v start#v about#r want#v really#r of psycholinguistic processes across the four drug types, we also calculate the Pearson correlation between the dominance scores for all LIWC classes. As seen in Table 6, empathogens appear to be the most dissimilar with respect to the other drug types. Hallucinogens instead seem to be most similar to stimulants and sedatives.

Emotions and Drugs
Another interesting dimension to explore in relation to drug experiences is the presence of various emotions. To quantify this dimension, we use a methodology similar to the one described above, and calculate the dominance score for each of six emotion word classes: anger, disgust, fear, joy, sadness, and surprise (Ortony et al., 1987;Ekman, 1993). As a resource, we use WordNet Affect (Strapparava and Valitutti, 2004), in which words from WordNet are annotated with several emotions. As before, the dominance scores are calculated for the experiences reported for each drug type when compared to the other drug types. Table 7 shows the scores for the four drug types and the six emotions. A score significantly higher than 1 indicates a class that is dominant in that category. Clearly, interesting differences emerge from this table: the use of emphathogens leads to experiences that are high on joy and surprise, whereas the dominant emotion in the use of hallucinogens as compared to the other drugs is fear. Sedatives lead to an increase in disgust, while stimulants have a mix of anger and joy.

Conclusions
Automating language assessment of drug addict experiences has a potentially large impact on both toxicovigilance and prevention. Drug users are inclined to underreport symptoms to avoid negative consequences, and they often lack the self awareness necessary to report a drug abuse problem. In fact, often times people with drug misuse problems are reported on behalf of a third party (social services, police, families), when the situation is no longer ignorable.
In this paper, we introduced a new dataset    of drug use experiences, which can facilitate additional research in this space. We have described preliminary classification experiments, which showed that we can predict the drug behind an experience with a performance of up to 88% F1-score. To better understand the characteristics of drug users, we have also presented an analysis of the psycholinguistic process and emotions associated with different drug types.
We would like to continue the present work along the following directions: (i) Extend the corpus with texts written by people who supposedly do not ordinarily make use of drugs, using patient submitted forum posts when talking about ordinary medicines. The style of such patient submitted posts is expected to be similar to the one of drug experience reports, since both address writing about an experience with some particular substance; (ii) Explore the association between drug preferences and personality types. Following Khantzian's hypothesis (Khantzian, 1997), certain personalities may be more prone to a particular drug with respect to its subjective effects. Characterizing subjects by their potential drug preferences could enable clinicians, like in a reversed "recommender system," to explicitly warn their patients to avoiding particular kind of substances since they could become addictive.
The dataset introduced in this paper is available for research purposes upon request to the authors.  cology, 9:184-191.