BioReddit: Word Embeddings for User-Generated Biomedical NLP

Word embeddings, in their different shapes and iterations, have changed the natural language processing research landscape in the last years. The biomedical text processing field is no stranger to this revolution; however, scholars in the field largely trained their embeddings on scientific documents only, even when working on user-generated data. In this paper we show how training embeddings from a corpus collected from user-generated text from medical forums heavily influences the performance on downstream tasks, outperforming embeddings trained both on general purpose data or on scientific papers when applied on user-generated content.


Introduction
In the Natural Language Processing community, user-generated content, i.e. data from social media, user forums, review websites, and so on, has been the subject of many studies in the past years; the same holds for the biomedical domain, where there has been a great effort on the applications of NLP techniques for biomedical scientific publications, patient records, and so on. However, the intersection of the two fields is still in its infancy, even when dealing with relatively basic NLP tasks. For instance, in the field of usergenerated biomedical natural language processing (hence UG-BioNLP), to the best of our knowledge there are no publicly available corpora for Named Entity Recognition (NER) akin in size and purpose e.g. to the CoNLL 2003 dataset. (Tjong Kim Sang and De Meulder, 2003), making it hard to compare systems effectively. Moreover, while there have been experiments on training word embeddings with biomedical data, we are not aware of any publicly available word embeddings trained on UG-BioNLP data.
For this reason, we decided to investigate the impact of using purpose-trained word embeddings in the Bio-UG field. In order to train such embeddings, we collected a dataset from Reddit, scraping posts from medical-themed subreddits, both on general health topics such as 'r/AskDocs', or on disease-specific subreddits, such as 'r/cancer', 'r/asthma', and so on. We then trained word embeddings on this corpus using different off-the-shelf techniques. Then, to evaluate the embeddings, we collected a second dataset of 4800 threads from the health forum HealthUnlocked, which was annotated for the NER task. Then, we analyzed the performance of the embeddings on the tasks of NER and of adverse effect mention detection. For NER, we used Conditional Random Fields as a baseline. We compared them against Bidirectional LSTM-CRFs (Lample et al., 2016), on which we analyzed the impact of using our custom-trained word embeddings against embeddings trained on general purpose data and scientific biomedical publications when evaluating on our purpose-built HealthUnlocked dataset and on the PsyTar and CADEC corpora. Finally, we evaluated the performance of a simple architecture for adverse reaction mention detection on the PsyTAR corpus. We conclude the paper explaining our intentions for future research, in other to obtain other results that confirm the preliminary findings we present in this work.

Related Work
The benefit of using in-domain embeddings for the biomedical domain has already been proven effective. For example, (Pakhomov et al., 2016) and (Wang et al., 2018) found that using clinical notes or biomedical articles for training word embeddings has generally a positive impact on down-stream NLP tasks. (Nikfarjam et al., 2015) trained embeddings on user-generated medical content and used them successfully on the pharmacovigilance task; however, they trained the embeddings an adverse reaction mining corpus, hence making them too task-specific to be considered useful on generic UG-BioNLP tasks.

BioReddit
To train our embeddings on user-generated biomedical text, we choose to scrape data from the discussion website Reddit. The website is organized by forums, called subreddits, where the discussion is restricted to a topic, e.g. general news, computer science, and so on. There is a great number of health-themed subreddits, where users from all around the world discuss their health problems or ask for medical advice, which is ideal for training our embeddings.
We also evaluated the micro-blogging platform Twitter as a possible source for the embeddings, but we quickly discarded it due to its unstructured nature. On Twitter, in fact, information is not preaggregated by subject, and one has to search for the required posts by searching for keyword or hashtag. This, along with the restrictive limits imposed by Twitter APIs, makes it hard to find relevant content, so we decided to continue with Reddit instead.
We designed a scraping script that downloaded discussions from 68 health themed subreddits. We selected subreddits where users • could ask for advice, e.g. /r/AskDocs, /r/DiagnoseMe, r/AskaPharmacist, • discuss a specific illness, e.g. r/cancer, r/migraine, r/insomnia, • can discuss on any health-related topic, e.g. r/health, r/HealthIT, r/HealthInsurance. We collected all the posts from these subreddits from the beginning of 2015 to the end of 2018. After that, we cleaned the corpus for bot-generated content, e.g. bots automatically suggesting to seek professional medical advice. We obtained a corpus with 300 million tokens and a vocabulary size of 780,000 words. While the number of tokens is considerably lower than the size of other word embedding training datasets, which could be two orders of magnitude bigger, the vocabulary is quite big; for example, GloVE (Pennington et al., 2014) was trained with a 1.2 million big vocabulary and 27 billion tokens when using Twitter, and on a 600,000 word vocabulary and 6 billion tokens when using Wikipedia.

HealthUnlocked
In order to evaluate our embeddings, as a first step, we decided to focus on the Named Entity Recognition task. We obtained 4800 forum threads from HealthUnlocked 1 , a British social network for health where users can discuss their health with people with similar conditions and obtain advice from professionals.
We annotated the dataset by marking the entities belonging to seven categories, namely: Phenotype, Disease, Anatomy, Molecule, Gene, Device, and Procedure. We describe in detail the categories in Table 1.
Since the dataset is collected from patients' discussions, the language used is far from technical. For example 2 , • an user describes paresthesia of arm as "a tickling sensation in my arms"; • another patient, to describe her swollen abdomen, writes that she "looked six months pregnant", • another user writes that "her mood is low", to explain her depression. All these phrases, while expressed in layman's language, describe very specific symptoms. For this reason, we developed a set of annotation guidelines where the annotators were asked to mark any possible mention of an entity belonging to the seven categories above, even if not expressed with technical language. After running a pilot annotation task on a small set of discussions, we fine tuned the annotation guidelines, and we asked PhD-qualified biomedical experts to annotate 4800 threads from the forums. After the annotation, the files were shuffled and split in train, test, and development set, obtaining 8750, 2526, and 1250 sentences respectively. The number of annotations per category and per set is described in Table 1.

PsyTAR
The PsyTAR dataset "contains patients expression of effectiveness and adverse drug events as-

CADEC
The CADEC corpus (Karimi et al., 2015) is a corpus of consumer reviews for pharmacovigilance. It is sourced from Ask a Patient too and it is annotated for mentions of concepts such as drugs, adverse reactions, symptoms and diseases, which are linked against SNOMED and MedDRA.

Embeddings
Using the dataset described in Section 3.1, we trained three word embedding models, namely GloVe (Pennington et al., 2014), ELMo (Peters et al., 2018), and Flair (Akbik et al., 2018). We choose these models due to their popularity, performance, and relative low resource requirements. In particular, GloVe requires just hours to be trained on a CPU, while ELMo and Flair obtained state-of-the-art results in the NER task at the time of their publication, and both models can be trained in relatively short time (∼1 week) using 1 or 2 GPUs. As general purpose and PubMed embeddings, we use the ones provided or recommended by the respective architecture authors; unfortunately, we are not aware of any GloVe  PubMed pre-trainer embeddings available in the public domain. Using our BioReddit dataset, we trained all the embeddings with their default parameters, as described in their respective papers.

Named Entity Recognition
In order to evaluate our embeddings we use Conditional Random Fields and as a baseline, and then we evaluate our embeddings using a Bidirectional LSTM-CRF sequence tagging neural network (Lample et al., 2016). We refer the reader to the original paper for an explanation on how this architecture works, as the details are outside to the scope of the present paper. We present our results in Table 2. As expected, all the neural architectures largely improve the results obtained by the CRF and, in line with the literature, Flair performs slightly better than ELMo, which in turn performs better than GloVe. Using our purpose-built embeddings, called BioReddit in the Table, we always obtain an improvement with respect to using embeddings trained on generalpurpose data (Default in Table) or on PubMed, barring the smallest GloVe vectors.   In Table 3 we provide a per-category breakdown of the best performing embeddings, i.e. Flair embeddings trained on our BioReddit corpus. It's interesting to note how the most difficult categories are Device and Phenotype. We explain this results by noting that the former is the least represented category in the corpus, while the latter was actually expected to be the hardest category. In fact, looking into the corpus, we found that users are relatively precise when talking about disease names, genes, molecules, and so on, while they don't necessarily describe their symptoms using "proper" medical language.
In Table 4 we see the results we obtain on the NER task on the PsyTAR and CADEC corpora while using Flair embeddings, where BioReddit embeddings always outperform general-purpose and PubMed trained ones. Interestingly, PubMed embeddings behave considerably worse than the others on the PsyTAR corpus, which seems to support the intuition that using a specialized scientific corpus is not always the guarantee of better performance.

Adverse Reaction Mention Detection
The task of Adverse Reaction Mention Detection (hence ADR) consists in detecting whether in a sentence a user mentions that he is experiencing/experienced an adverse reaction to a drug. For this task, we designed a simple neural architecture, where a bidirectional GRU (Cho et al., 2014) reads a sentence, and a softmax layer on its top performs the binary classification task of detecting wether the input sentence contains an ADR or not. When evaluating on the PsyTAR corpus we again obtain the best performance when using our BioReddit embeddings, followed by the PubMed trained ones and the default ones.

Conclusions
In this paper we showed how training ad-hoc embeddings for the task of user-generated biomedical text processing improves the results in the tasks of named entity recognition and adverse reaction mention detection. While preliminary, our results show a strong indication that embeddings trained on biomedical scientific literature only are not guaranteed to be effective when used on usergenerated data, since people use "layman terms" which are seldom, if ever, used in scientific literature. As future work, we acknowledge the need to better investigate the results we present here. A good starting point would be to analyze other embedding techniques, in order to investigate if the performance improvement is due to embedding techniques themselves or to the datasets used. Moreover, we need to analyze the performance of our BioReddit embeddings on non-user generated content, as e.g. scientific abstracts, in order to investigate whether they are able to perform effectively on this domain too. Finally, we think that a manual investigation of the results of the downstream tasks is important, to investigate e.g. if the improvement in the ADR task is due to the embeddings helping to classify sentences with more colloquial language. Unfortunately, due to licensing and privacy issues, we are not allowed to release the HealthUnlocked corpus. However, we make available our BioReddit embeddings trained on GloVe, ELMo and Flair at https:// github.com/basaldella/bioreddit. For the sake of reproducibility, we also we make available our PsyTAR preprocessed splits online at https://github.com/basaldella/ psytarpreprocessor.