A Framework for Developing and Evaluating Word Embeddings of Drug-named Entity

We investigate the quality of task specific word embeddings created with relatively small, targeted corpora. We present a comprehensive evaluation framework including both intrinsic and extrinsic evaluation that can be expanded to named entities beyond drug name. Intrinsic evaluation results tell that drug name embeddings created with a domain specific document corpus outperformed the previously published versions that derived from a very large general text corpus. Extrinsic evaluation uses word embedding for the task of drug name recognition with Bi-LSTM model and the results demonstrate the advantage of using domain-specific word embeddings as the only input feature for drug name recognition with F1-score achieving 0.91. This work suggests that it may be advantageous to derive domain specific embeddings for certain tasks even when the domain specific corpus is of limited size.


Introduction
The ability of word embeddings to capture latent, contextual information has proven useful to a variety of NLP tasks, such as named entity recognition (Santos & Guimarães, 2015), syntactic parsing (Levy & Goldberg, 2014), and question answering (Iyyer et al., 2014). Within biomedical research, word embeddings developed in most previous studies were generated from very large, generic corpora (e.g. news articles). This is appropriate for generalized language models. However, for specialized domains and tasks, it may be beneficial to generate word embeddings from a targeted corpus. We propose a biomedical domain-specific word embedding model and a novel evaluation framework, which mainly focus on representing drug names in the current stage. This framework can be expanded to other biomedical entities such as protein, gene, and chemical compound names in the future. We evaluate the developed word embeddings with a comprehensive intrinsic evaluation framework that includes relatedness, coherence, and outlier detection assessment, as well as an extrinsic evaluation that focuses on the task of drug name recognition and classification with a bidirectional long short-term memory (Bi-LSTM) RNN model.

Related Work
In the biomedical domain, word embeddings are primarily used for biomedical named entity recognition (BNER) with evaluations conducted on tasks such as JNLPBA (Kim et al., 2004), BioCreAtIvE (Hirschman et al., 2005), and BioNLP Shared Tasks. Tang et al. (2014) explored the impact of three different types of word representations (WR) on clustering-based representation, distributional representation and word embedding. Segura-Bedmar et al. (2015) generated word embeddings with word2vec and a combined Wikipedia and MedLine corpus. The results were evaluated on the SemEval-2013 Task 9.1 Drug Name Recognition dataset . Wang et al. (2015, November) used word embeddings for bio-event trigger detection. Li et al. (2015) incorporated word embedding features with bag-of-words (BOW) features for bio-event extraction and evaluated results on the BioNLP 2013 GENIA task (Nédellec et al., 2013). Drug name recognition (DNR) in biomedical literature and clinical notes is essential for many medical information and relation extraction tasks (e.g. drug-drug interaction). Significant effort has been devoted to DNR and the common methods can be categorized as (Lu et al., 2015): (1) dictionary-based approaches (Rindflesch et al., 2000;Sanchez-Cisneros et al., 2013), (2) rule-based/ontology-based approaches (Hamon & Grabar, 2010;Coden et al., 2012), (3) machine learning-based approaches (Lamurias et al., 2013;Lu et al., 2015), and (4) hybrid approaches (Korkontzelos et al., 2015).

Word Embeddings Training
We extracted text from PubMed and DrugBank to construct our corpus. For PubMed, we used "drug" as the keyword of query to broadly select drug related abstracts, which yielded 474,273 abstracts. From DrugBank 1 Release Version 5.0.5 we extracted the fields: "description" "indication" "pharmacodynamics" "mechanismof-action" "toxicity" for 8,226 drugs. We employed the skip-gram model in word2vec to generate word embeddings. Moreover, as studies have found that word embeddings have a consistent relationship with word frequencies, even after the interception of frequency-based effects by algorithms and vector length normalization (Schnabel et al., 2015), we employed correlation analysis between vectors and frequencies as the evaluation metric to tune the parameters for the word embedding model. For our final result, we trained the word embedding model in word2vec with parameters: size = 420, window = 5, min_count = 2.

Relatedness assessment
Relatedness evaluation is the most popular and direct intrinsic word embedding evaluation method. It is expected that high quality word embeddings will display significant correlation (e.g. Pearson's, Spearman's) between the cosine similarity of the embedding vectors for related word pairs and the human scores. We evaluated the results on two biomedical domain inventories: UMNSRS-Rel and UMNSRS-Sim (Pakhomov et al., 2010). These datasets provide human-annotated scores of relatedness and similarity between clinical term pairs. We measured the correlation between the scores provided by the UMNSRS datasets and calculated by our model, using Spearman's correlation coefficient. We also compared our model to a publicly available word embedding set trained on about 100 billion words from Google News samples 2 . 1 www.drugbank.ca/releases/latest 2 https://code.google.com/archive/p/word2vec/  As shown in Table 1 and 2, our model and UMNSRS show positive correlations in both relatedness and similarity assessment, with most of the correlation coefficients higher than 0.5, which means the relationship represented in vector space is consistent with human annotations. In particular, the highest consistency is achieved for the relationship of drug-drug pairs, where coefficients reach 0.737 and 0.764 for relatedness and similarity, respectively. In addition, the proposed model trained on PubMed+DrugBank shows significantly higher correlations with human scores than the model trained on a Google News corpus in all word pair types. This is important because the Google News based embeddings were trained on an extremely large dataset compared to our corpus.

Coherence assessment
Conceptually, we expect that a good word embedding should be surrounded by a coherent neighborhood of similar words. From this concept, we propose a novel intrinsic evaluation metric as a supplement to current relatedness analysis (Schnabel et al., 2015). In coherence assessment, we assess whether a given word embedding is mutually related to the word embeddings in its local neighborhood. Here we created a neighborhood for each drug name and explored the relation with the closest neighbor terms. We expect that other drug entities should be preferentially represented in the neighborhood. Setting the neighborhood size from 3 to 10, we calculated the percentage of drug names within the neighborhood of each drug, with selected results shown in Table 3.  From Table 3, we see that the percentage of drug entities declines with the expansion of neighborhood size. Noting that neighbors were arranged by the cosine similarity relative to the target word, such decline implies that drug entities tend to be the closest neighbors. Beyond that, drug entities still occupy more than half of the nearest 10 neighbors. These results suggest there is a strong coherence in the semantic space.

Outlier Detection
As a final intrinsic measure of word embedding quality, we consider a modification of a previously proposed outlier detection task. Given a group of words W, the compactness score of word " ∈ represents the compactness of the cluster W\{wm}. Performance on the outlier detection task can be evaluated by accuracy and outlier position percentage (OPP) (Camacho-Collados & Navigli, 2016). Ideally, if outliers in all the groups were identified and listed at the last position, accuracy and OPP should be 1 and 100% respectively. In this study, the goal of outlier detection is to identify the non-drug words as outliers. We created two datasets each with 400 groups of words (|D|=400). Following the work of Camacho-Collado and Navigli, the first dataset, D-Manu, contains 4 to 8 drugs and 1 manually selected non-drug outlier ( | | ∈ [5,9] ). Additionally, we modify the previously presented work by forming a second dataset, D-Rand, in which each group contains 4 to 8 drugs and 1 randomly selected non-drug outlier (| | ∈ [5,9] ). Tables 4 and 5 show the evaluation results of outlier detection on D-Rand and D-Manu. On D-Rand, outliers were identified in more than 40% of groups across different sizes, and OPP values indicate that the average outlier position was around 70% to the right end (100%) of the list arranged by compactness score. Meanwhile, for D-Manu, the accuracy values are all higher than 0.8 and the OPP values are all above 93%.   To gain further insight on the potential correlation between the outlier task performance and the similarity distribution over the outlier term and the non-outlier terms, we calculated the average similarity between each pair of nonoutlier terms and the average between nonoutliers and the outlier for each group in D-Rand and D-Manu. We found that the average similarity between non-outliers was about 0.21. The average similarity between non-outliers and randomly selected outliers and manually selected outliers was about 0.16 and 0.12, respectively. This result confirmed that the greater distinction in word similarity is consistent with the better accuracies in outlier detection.

DNR with Bi-LSTM Model
We employ a bidirectional long short-term memory (Bi-LSTM) RNN model that is designed to process text input as a sequence of tokens (constituent parts, usually words) and predict the label for each token. The BLSTM-RNN model combines two RNNs: the forward RNN processes the sequence from left to right and the backward RNN processes it from right to left. We use a BIO scheme for the sequence labeling task. Specifically, each token is labeled as one B-X, I-X or O indicating it is at the beginning (B), inside (I), or outside (O) of the entity of type X (e.g. drug name). In order to achieve the best results and compare the impact of the word embedding model in the labeling task, we introduced three BLSTM-RNN variants: (1) Fixed embedding (BLSTM-F): Word embedding values were provided by the pre-trained word embedding model and treated as fixed constants; (2) Varied embedding (BLSTM-V): Word embedding values were also provided by the pre-trained word embedding but treated as learnable parameters; (3) Randomly-initialized embedding (BLSTM-R): Word embedding values were initialized randomly and treated as learnable parameters.

Experiments on Drug Name Recognition
We evaluated our model on DDI-Extraction-2011 task (Segura-Bedmar et al., 2011) using two metrics: Exact matching-the predicted entity must have exactly the same boundary with the annotated entity and Partial matching-the predicted entity must have some overlap with the annotated entity. Table 6 shows the results of three BLSTM models. Regarding to the impact of pre-trained word embeddings, there is no obvious improvement when introducing the pretrained embedding values instead of randomly initialized vector values. Moreover, the f1-score of BLSTM-V that sets embedding values as learnable parameters in RNN model is increased to 0.911 from 0.891 in BLSTM-F that treats them as fixed constants. Overall, our BLSTM models achieve very good results on DNR according to f1-scores, and treating embedding values as learnable parameters, regardless of pretrained or randomly initialized, lead to better results than setting them fixed, indicating the great advantage of RNN models for drug name recognition task.

Experiments on Drug Name Classification
In DDI-Extraction-2013 challenge , the drugs were annotated with four types instead of one type in 2011 task, including: drug, brand, group, and drug_n. Thus, it becomes a drug name recognition and classification task. We evaluated our results using four metrics provided by the organizers, with f1-scores shown in Table 7. Pre-trained word embeddings showed their advantages, for instance, f1 of strict matching were improved 16% in BLSTM-V than BLSTM-R. While updating the pre-trained embedding values did not show obvious improvement by comparing BLSTM-F and BLSTM-V.

Conclusion
We presented biomedical domain-specific word embeddings formulated with the word2vec model using PubMed and DrugBank text sources and a comprehensive intrinsic and extrinsic evaluation framework for word embeddings that includes new and existing metrics. We found that our word embeddings demonstrated superior performance based on relatedness assessment, neighborhood coherence, and outlier detection. Moreover, we also found that these embeddings performed better than those generated from very large datasets such as Google News. This is significant because our training dataset is approximately two orders of magnitude smaller. Since drug name recognition (DNR) is an important biomedical NLP task, we used DNR as the downstream task for extrinsic evaluation of the developed drug name embeddings. We utilized the pre-trained word embeddings in Bi-LSTM model for the task of drug name recognition and classification. For drug name recognition, setting embedding values as learnable parameters in RNN model has more impact on the performance than utilizing pretrained word embeddings. For drug name classification, pre-trained word embeddings offer significant performance increases over randomly-initialized embeddings, while updating the pre-trained embedding values during the BLSTM model training has little improvement. This work provides a useful tool or framework for processing raw biomedical text and extracting drug entities, which could be helpful in processing other unstructured data and medical entities.