Entity Extraction in Biomedical Corpora: An Approach to Evaluate Word Embedding Features with PSO based Feature Selection

Text mining has drawn significant attention in recent past due to the rapid growth in biomedical and clinical records. Entity extraction is one of the fundamental components for biomedical text mining. In this paper, we propose a novel approach of feature selection for entity extraction that exploits the concept of deep learning and Particle Swarm Optimization (PSO). The system utilizes word embedding features along with several other features extracted by studying the properties of the datasets. We obtain an interesting observation that compact word embedding features as determined by PSO are more effective compared to the entire word embedding feature set for entity extraction. The proposed system is evaluated on three benchmark biomedical datasets such as GENIA, GENETAG, and AiMed. The effectiveness of the proposed approach is evident with significant performance gains over the baseline models as well as the other existing systems. We observe improvements of 7.86%, 5.27% and 7.25% F-measure points over the baseline models for GENIA, GENETAG, and AiMed dataset respectively.


Introduction
The tremendous amount of information accumulated in the domains of molecular biology has drawn the attention of biomedical natural language processing (BioNLP) community in order to facilitate the development of various tools for various text processing applications, curation and organization of ever-growing biomedical literature etc. Entity extraction is crucial step for solving several pipelined applications such as information extraction, automatic summarization, questionanswering, word sense disambiguation etc. Biomedical entities mostly refer to the biological sequences of protein & gene such as DNA, RNA, cell type, cell line etc. (Kim et al., 2004). The way of extracting these information from biomedical and clinical texts refers to as entity extraction. An automatic system which can extract biomedical names such as gene, protein or any disease name from text can substantially reduce the human efforts. However, extracting these entities from text poses several challenges which are presented as follows: 1. Named entities are very generative in nature, i.e. many new names are continuously being generated. Any dictionary can not capture all the various forms of a given name.
cal text. The research challenges have been addressed in the literature including in some sharedtask challenges, such as JNLPBA (Joint Workshop on Natural Language Processing in Biomedicine and its Applications) in 2004 (Kim et al., 2004) and BioCreative (Critical Assessment for Information Extraction in Biology Challenge) II GM (gene mention) subtask in 2007 (Smith et al., 2008). Over the years several benchmark corpora have been created that do not conform to the uniform annotation guidelines. Therefore the system, developed by targeting a specific domain, often fails to show reasonable accuracy when it is evaluated for some other domains. In our work we attempt to build a system for entity extraction that performs well across various biomedical corpora. Popular existing system mostly rely on rule-based system or supervised machine learning technique to automatically extract entities. They looked upon this problem as in terms of sequence labeling and used algorithm such as hidden markov models (HMM) (Zhao, 2004), support vector machines (SVM) (Kazama et al., 2002;GuoDong and Jian, 2004), maximum entropy Markov model (MEMM) (Finkel et al., 2005) and conditional random fields (CRF) (Ekbal et al., 2013;Settles, 2004;Kim et al., 2005). These supervised learning models is fully dependent on the features that we use for training. Some of the popular features used in the existing studies include linguistic features such as morphological, syntactic and semantic information of words and domain-specific features from biomedical ontologies such as Bio-Thesaurus (Liu et al., 2006) and UMLS (Unified Medical Language System) (Bodenreider, 2004). However, these features heavenly account to the problem of data sparsity. In the recent past, there has been huge interest in using large unlabeled corpus to generate word representation feature using deep neural network technique. We are motivated by the strength of deep learning concepts to build our model. We use the well-known word embedding model that is a robust framework to incorporate word representation features (Mikolov et al., 2013b). Word representation feature is a mathematical description of the word in vector form. Each position of vector corresponds to a feature with some semantic or grammatical inference which leads to the term word feature. Word representation features contains latent syntactic/semantic informa-tion of a word. The main objective to use word embedding is to provide more useful information to the model being trained. Vector based word representation has powerful capability that captures the phenomenon that words having the similar meanings should appear together (Mikolov et al., 2013b). In traditional machine learning, data sparsity is a problem that often causes the degradation in performance. This drawback could be overcome by the incorporation of word embedding with the presumption that similar type of word (as to semantics) appear in the similar context (Mikolov et al., 2013b).
The aim is to exploit the usefulness of neural network based word embedding (Bengio et al., 2003) as a feature for entity extraction in biomedical text. In addition we also make use of a very diverse feature set that exploits the properties of data and problem specific knowledge. We restrict ourselves from using much domain-specific information for feature extraction, keeping in view easy adaptability of the system to more than one biomedical corpora.
However, the huge dimensionality of the word representation vector often contributes to the complexity of the system. This motivated us to apply feature selection technique to reduce the dimensionality contributed by word embedding as well as to improve the system performance. Our algorithm for feature selection is based on wrapper based approach, which is formulated as an optimization problem. We use Particle Swarm Optimization(PSO) (Kennedy and Eberhart, 1997) as the underlying optimization strategy. Particle Swarm Optimization is an evolutionary technique, inspired by the social behavior of birds. Some recent studies show that PSO converges faster compared to some other widely used optimization techniques (Bansal et al., 2011). Inspired by this observation we use PSO in our current study. To analyze the effect of pruned word embedding, we have carried out an experiment with all the handcrafted features and the reduced features as determined by PSO. We perform experiments on three standard datasets, namely GENIA, GENE-TAG and AiMed. Evaluation results show that we achieve significant performance gains with the use of pruned word embedding feature set. The best performance of the system was obtained when we apply PSO based feature selection technique on combination of handcrafted features set and word embedding features. The key contribution of this paper are, (i) proposal of PSO based feature selection technique in bio-medical entity extraction.
(ii) analysis of feature selection on only word representation features. (iii) impact of feature selection on word representation features with handcraft features.

Related Works
There has been quite a significant number of existing works available for biomedical named entity recognition (BNER). These approaches can be divided into three major categories: (1) dictionary based, (2) rule based and (3) machine learning based techniques. Among these existing approaches, machine learning based techniques have gained a lot more attention due to the availability of sufficiently good amount of annotated corpus. For example, majority of the systems submitted to the JNLPBA challenge made use of machine learning algorithms which have been observed to significantly outperform the dictionary based methods. Some of the recent works in BNER includes the unsupervised model as proposed in (Zhang and Elhadad, 2013), and the system based on CRF (Li et al., 2015a). A two-phase approach based on semi-Markov CRF is proposed in (Yang and Zhou, 2014). In the first phase boundaries of entities are identified while in the second phase semantic labeling is performed to label the detected entities. A CRF based system has been proposed by (Tang et al., 2015), where in the first step boundaries of NEs are identified and in the second step appropriate labels are assigned. (Grouin, 2014) performed experiments on the i2b2/VA-2010 challenge dataset to detect bacteria and biotopes names. They developed a model based on CRFs. An unsupervised approach is proposed in (Han et al., 2016) that made use of clustering based active learning. They have used Shared Nearest Neighbor (SNN) clustering technique. The work reported in (Li et al., 2015a), authors have proposed a parallel CRF algorithm (MapReduce CRF) which provides a mechanism to minimise the time taken for CRF learning. They showed that the proposed approach outperforms other traditional models in terms of time and efficiency. While, most of the proposed system used CRF, recently (Patra and Saha, 2013) proposed a an entity extraction system based on SVM. Par-ticularly, they have introduced a tree kernel based function that can efficiently solve the full NER task. The work proposed in (Tohidi et al., 2014) aims to improve the performance of entity extraction using statistical character-based syntax similarity (SCSS) algorithm. This algorithm computes the similarity between the identified candidate entities and a known set of well-known NEs. This set of NEs is created by extracting the most frequently occurring NEs in the GENIA V3.0 corpus. In recent times deep learning based approaches such as Recurrent Neural Network and Bi-directional LSTM have also used for entity extraction (Li et al., 2015b;Limsopatham and Collier, 2016). It is well known that relevant features play an important role for building a high accurate system. In our work, in addition to the standard features we also use the features extracted from the word embedding model. Bengio et al.(Bengio et al., 2003) have proposed a neural network based model for vector representation of words. Distributed representation (also known as word embedding) of a word has been used to improve the performance of various NLP tasks like Part-of-Speech (POS) tagging, NER in news-wire domain (Collobert et al., 2011), parsing (Socher et al., 2013;Turian et al., 2010) etc. Word cluster has been used used by Miller et al. (Miller et al., 2004) to boost the performance of a NER system. Tang et al. (Tang et al., 2012;Tang et al., 2013) have reported that performance of biomedical entity extraction can be improved when word representation is used as a feature to CRF and SVM classifiers. Here we propose a PSO based feature selection technique that determines the most relevant features from a full word embedding set, and use this subset as feature for classifier's training. Feature selection has been widely used for many tasks such as gene expression (Ding and Peng, 2005), face recognition (Seal et al., 2015) and signal processing (Alamedine et al., 2013). Dealing with biomedical text is, however, more difficult and challenging as the features have non-numeric values and the texts are heavily unstructured. Except the few works such as NER (Ekbal and Saha, 2016), co-reference resolution (Sikdar et al., 2015) and sentiment analysis (Gupta et al., 2015), systematic methods of feature selection using metaheuristics algorithms are very rare. Nevertheless, the importance of using pruned neural language model based word representation features with effective feature selection have not been exploited so far in the literature.

A Brief Introduction to Particle Swarm Optimization (PSO)
Particle Swarm Optimization (PSO) is a metaheuristic intelligent technique inspired by social behavior of the swarm for its survival (Eberhart and Shi, 1998;Kennedy and Eberhart, 1997). This is a population based technique which is perceived in birds and fishes for the search of the best path. In general, PSO consists of the swarm of the particle where each particle has its particular position in the search space with which it moves around the search space by some velocity. The particle selects the best path on each iteration by using its memory and by learning the effective path that was followed previously by the swarm. The new position is chosen on the basis of the knowledge gained previously by its self-best position and the best position of the swarm. PSO, being a meta-heuristic model, makes few or no assumption about the problem being optimized and can search very large spaces of candidate solutions. This makes PSO highly efficient for the optimization purpose (Yan et al., 2013). The algorithm iterates by keeping track of two variables: Global best position represents the most promising vector found so far, and Personal best position denotes the particle's own personal best solution.
2.1.1 Algorithm: PSO based Feature Selection 1. Initially, we randomly set the swarm population. Each particle of the swarm is represented by binary-valued features of length n (total no. of feature) and has its position and velocity with which it moves in search space. Mathematically, particle position and particle velocity are represented as: where p(i, j) ∈ {0, 1}, i = 1, 2, ..., N and j = 1, 2, ..., n where N is no. of particle. Particle maintains its best position ( − → B (i)) that they have achieved so far and also the global best position ( − → G )i.e., the best position of the particle having the best solution.
2. Particle's position − → P (i) value is set either {0, 1} on the basis of following expression: . The memory is updated by keeping track of the best position and global best position. 4. Initially, the value of best position ( − → B (i)) of every particle is set to 0. At every epoch(ep) the value of the best position is updated as follows: 6. Originally, the velocity vector is generated randomly. At each iteration, velocity of a particle is updated according to the following equation: where ω(0 < ω < 1), φ 1 and φ 2 are known as inertia weights. These parameters are initialized with an uniformly generated random numbers in the range (0,1). The b (i,j) , p (i,j) , and g (j) denote the j th components of − → B (i), − → P (i) and − → G , respectively.
7. The position of a particle is updated by the following mathematical expression: This represents the sigmoid function. Thus, we update the particle position value of 0 or 1 on the basis of the value of velocity.

Learning Word Representations
Word embedding (also known as distributed word representations) persuade a real-valued latent semantic or syntactic vector for each word from a large unlabeled corpus by using continuous space language models (Tang et al., 2014). Better word representation can be obtained if we have a large amount of training data as the obtained realvalued vectors of words become more representative. We use the popular word2vec 1 tool proposed by Mikolov et al. (Mikolov et al., 2013a) to extract the vector representations of words. Owing to its simpler architecture which reduces the computational complexity, this technique can be used for large corpus. Two models have been proposed in (Mikolov et al., 2013a) to learn vector representation known as Continuous Bag-of-Words Model (CBOW) and Skip-gram model. Since skip-gram model is able to capture the semantic information of a word, we adapt this to train the model for vector representation. The Skip-gram architecture tries to maximize the classification of a word based on the other words in the same sentence. More formally, given a sequence of training words w 1 ,w 2 ,......, w T , the objective of the Skipgram model is to maximize the average log probability where c is the window size. Here, we show few words that are more nearby to any biomedical entity: 'antigen', 'lymphocytes' and 'inhibited'. If we look at the most similar words for the word 'lymphocytes', we observe that apart from syntactically similar words like 'Tlymphocytes', 'B-lymphocytes', it is also able to capture the words which are semantically similar like 'CD3+', 'PBLs' and 'T-cells'.

Features for Entity Extraction
The features being extracted are described as follows: 1. Contextual feature: It is the local contextual feature which refers to the tokens which appear within the window size of 10 words, i.e 5 to the left and 5 to the right w.r.t current token.
1 https://code.google.com/p/word2vec/ 2. Word prefixes and suffixes: These features refer to the fixed length character sequences stripped either from the left or rightmost positions of the words.
3. Word length: It is observed that short words are rarely the NEs. We define a binary-valued feature that triggers the value 1 if the length of current word is greater than the threshold value specified. The threshold value is set as 5 in this case.
4. Part-of-Speech (PoS) information: PoS provides useful syntactic evidence for detecting named entities (NEs).We use PoS information of the current and/or the surrounding token(s) as the feature. The PoS information was extracted from the GENIA tagger 2 V2.0.2.

Chunk information:
We use GENIA tagger V2.0.2 corpus to extract the chunk information. We employ the chunk information of the present and neighboring tokens as the features.
6. Word shape: Word shape is defined as the mapping of each word to its equivalent class. In order to implement this feature we normalize the words by converting every capital character by 'A', small character to 'a' and digit to '0'. After this conversion, we squeeze the consecutive characters into a single character. For example, if we consider the token 'Ly-49', the normalized word for this token would be 'Aa-00'.
7. Word class feature : This feature is based on the concept that entities present in the same class are mostly similar. Here, all the capital letters are converted to 'A', small letters to 'a', numbers to 'O' and non-English characters to '-'. After this conversion, we squeeze the consecutive characters into a single character. For example, the word class feature for the token 'IL-2-mediated' is 'AA-O-aaaaaaaa', which is further reduced to 'A-O-a'.
8. Orthographic features: We use several orthographic features that consider capitalization and digit information. These features are: initial capital, all capital, capital in inner, initial capital then mix, only digits, digit with special character, initial digit then alphabet, digit in inner. It is observed that some symbols like (',', '-', '.', ' ') are very common in the biomedical text. Some symbols like ',' are also very helpful for the identification of NE boundaries.

Methodology
We AiMed corpus was created using 20,000 sentences having gene/protein names extracted from the Database of Interacting Protein (DIP). We use 7,500 labeled sentences for training and 2,500 sentences for validation. For evaluation we use a test set consisting of 5,000 sentences. GENETAG dataset is derived from the 'Med-Tag' dataset. Training and test datasets comprise of 118K and 142K words, respectively. In order to properly denote the boundaries of NE, we use the IOB2 8 encoding scheme. We evaluate our system in terms of recall, precision and F-measure values. For evaluation we use the script, which was made available with the JNLPBA 2004 shared task 9 .

Baseline Models and Analysis
We start experiments with the first baseline (i.e. Baseline-1) by developing the model trained with all the features as discussed in Section-3. We evaluate the presence of word embedding features trained on various unlabeled data sets obtained from the different text sources. In order to realize the effect of each trained word representation model, we augment the word vector obtained from the respective model one by one to the baseline feature set. In order to obtain word embedding, we use four different models trained on the unlabeled data extracted from PubMed 10 , PubMed Central Open Access (PMC OA) 11 and the latest English Wikipedia dump 12 . Corpus statistics of PubMed and PMC OA are provided in Table- We develop the second baseline (i.e. Baseline-2) by executing the best word embedding model in combination with the hand-crafted feature set. We further develop the third baseline, i.e. Baseline-3 by merging word embedding feature set as determined by PSO along with the full handcrafted feature set. We observe that selection of relevant word embedding features helps in improving performance over the whole word embedding feature set. We generate 200-dimensional word vectors using the parameters 13 as follows: skip-gram model with a window size of 5, hierarchical soft-max training, and a frequent word sub-sampling threshold of 0.001. In order to make our proposed system generic, i.e. not biased to any particular domain of data, we use the same parameters of PSO in all our settings. We fine-tune the parameters ω, φ 1 and φ 2 by performing 3-fold cross validation experiments. We keep the number of particles 13 We use same parameters for training of all the four models  194,341 2,591,137,744 PubMed+PMC 22,792,858 229,810,015 5,487,486,225 (Pyysalo et al., 2013) of PubMed and PMC OA openly available biomedical literature; PubMed abstracts for articles that are also present in PMC OA were discarded while creating the data and the number of iterations as 10 and 100, respectively throughout all the experiments.
Effectiveness of PSO based feature selection is evident with performance improvement as shown in Table-5.

Comparison with Existing Feature Selection Techniques
Here we compare our PSO based feature selection technique with other existing feature selection techniques. We perform experiments with both filter and wrapper based models. For filter based model, we use univariate feature selection based on information theoretical concept like Information Gain. While for multivariate filter model we use correlation based feature selection. Our results indicate that PSO performs better than univariate by 3.03 % and multivariate by 2.60 % F-measure points for the GENIA dataset. We also observe quite similar behaviors for the other two datasets.
In addition, we also explore two popular wrapper based feature selection techniques, Genetic Algorithm (GA) (Holland, 1975) based feature selection (Ekbal et al., 2010) technique and Recursive Feature Elimination (RFE) (Guyon et al., 2002)    We also show evaluation of some of the existing approaches that attempt to make use of word representation features. A F-measure of 71.39% is reported in the work (Tang et al., 2014). Word representation feature was also used in (Chang et al., 2015) that reported to have achieved F-measure value of 71.77%.

Result and Discussion
We perform statistical significance (t-test) test on the results obtained by our proposed model. For different datasets, experiments are executed for 10 independent runs and the t-statistic is adopted to analyze the obtained experimental results. Using the known distribution of the test statistic, p-value is calculated. It is observed that p values are less than 0.04 for all the three data sets, which signify that our obtained results are statistically significant.

Error Analysis
Here, we analyze the outputs obtained for each dataset in order to identify the possible errors.We categorize the errors in three ways as follows:    Table 5: Comparisons (in terms of F-score) between whole word embedding features using WE(4) and the PSO selected word embedding features excluding handcrafted features. Here, W.E: Word embedding mostly with the entities having long and compounded wordforms such as 'T cell activation-specific enchance'. We also observe that our system lacks in correctly classifying the instances which includes brackets.
2. Incorrect entity type: This error is obtained when the entity is properly identified but it belongs to some other entity class. This error is more prominent in case of GENIA and GENETAG datasets. For GENIA dataset, classifier is mostly confused with 'Protein' vs. 'Cell line' or 'Cell type'. In total 126 Protein words are wrongly classified either as the 'Cell line' or 'Cell type'. While with the use of PSO, the rate of mis-classification was reduced to 97. In GENETAG, majority of classes are predicted as 'I-NEWGENE'. This may be due to the fact that majority of the instances belongs to the 'I-NEWGENE' cat-egory. While after applying PSO, we observe that mis-classification of 'I-NEWGENE' is significantly reduced from 325 to just 129.
3. Missed entity: Our system misses significant number of NE instances.It is found that number of false negatives count to 1357, 155 and 40 for GENTIA, AiMed and GENETAG, respectively. All these NEs are mis-classfied to belong to the other-than-NE category.

Conclusions & Future work
In this paper we have investigated the effect of word embedding features in addition to the handcrafted features for entity extraction from three benchmark biomedical data sets, namely GENIA, AiMed & GENETAG. We have evaluated the system using four different word representation schemes trained on extracted texts from PubMed, PMC OA biomedical literature and Wikipedia dump datasets. In addition to this we have performed PSO based feature selection on the whole feature set for the different data sets. We can conclude that instead of using a full word representation feature, if only prominent features are used, it could help in improving the performance of the system. In future work, we would like to perform additional experiments to fine-tune the dimensions of vectors and the parameters of CRF through cross-validation on the training set. The applicability of feature selection on word embedding features need to be explored in other domain also. In addition we want to compare the performance of representation obtained through word2vec to the others such as GloVe. We would also like to explore deep learning techniques replacing CRF.