Recurrent neural network models for disease name recognition using domain invariant features

Hand-crafted features based on linguistic and domain-knowledge play crucial role in determining the performance of disease name recognition systems. Such methods are further limited by the scope of these features or in other words, their ability to cover the contexts or word dependencies within a sentence. In this work, we focus on reducing such dependencies and propose a domain-invariant framework for the disease name recognition task. In particular, we propose various end-to-end recurrent neural network (RNN) models for the tasks of disease name recognition and their classification into four pre-defined categories. We also utilize convolution neural network (CNN) in cascade of RNN to get character-based embedded features and employ it with word-embedded features in our model. We compare our models with the state-of-the-art results for the two tasks on NCBI disease dataset. Our results for the disease mention recognition task indicate that state-of-the-art performance can be obtained without relying on feature engineering. Further the proposed models obtained improved performance on the classification task of disease names.


Introduction
Automatic recognition of disease names in biomedical and clinical texts is of utmost importance for development of more sophisticated NLP systems such as information extraction, question answering, text summarization and so on (Rosario and Hearst, 2004).Complicate and inconsistent terminologies, ambiguities caused by use of ab-breviations and acronyms, new disease names, multiple names (possibly of varying number of words) for the same disease, complicated syntactic structure referring to multiple related names or entities are some of the major reasons for making automatic identification of the task difficult and challenging (Leaman et al., 2009).State-ofthe-art disease name recognition systems (Mahbub Chowdhury and Lavelli, 2010;Dogan and Lu, 2012;Dogan et al., 2014) depends on user defined features which in turn try to capture context keeping in mind above mentioned challenges.Feature engineering not only requires linguistic as well as domain insight but also is time consuming and is corpus dependent.
Recently window based neural network approach of (Collobert and Weston, 2008;Collobert et al., 2011) got lot of attention in different sequence tagging tasks in NLP.It gave state-of-art results in many sequence labeling problems without using many hand designed or manually engineered features.One major drawback of this approach is its inability to capture features from outside window.Consider a sentence "Given that the skin of these adult mice also exhibits signs of de novo hair-follicle morphogenesis, we wondered whether human pilomatricomas might originate from hair matrix cells and whether they might possess beta-catenin-stabilizing mutations" (taken verbatim from PMID: 10192393), words such as signs and originate appearing both sides of the word "pilomatricomas", play important role in deciding it is a disease.Any model relying on features defined based on words occurring within a fixed window of neighboring words will fail to capture information of influential words occurring outside this window.
Our motivation can be summarized in the following question: can we identify disease name and categorize them without relying on feature en-gineering, domain-knowledge or task specific resources?In other words, we can say this work is motivated towards mitigating the two issues: first, feature engineering relying on linguistic and domain-specific knowledge; and second, bring flexibility in capturing influential words affecting model decisions irrespective of their occurrence anywhere within the sentence.For the first, we used character-based embedding (likely to capture orthographic and morphological features) as well as word embedding (likely to capture lexicosemantic features) as features of the neural network models.
For the second issue, we explore various recurrent neural network (RNN) architectures for their ability to capture long distance contexts.We experiment with bidirectional RNN (Bi-RNN), bidirectional long short term memory network (Bi-LSTM) and bidirectional gated recurrent unit (Bi-GRU).In each of these models we used sentence level log likelihood approach at the top layer of the neural architecture.The main contributions of the work can be summarized as follows

• Domain invariant features with various RNN
architectures for the disease name recognition and classification tasks, • Comparative study on the use of character based embedded features, word embedding features and combined features in the RNN models.
• Failure analysis to check where exactly our models are failed in the considered tasks.
Although there are some related works (discussed in sec 6), this is the first study, to the best of our knowledge, which comprehensively uses various RNN architectures without resorting to feature engineering for disease name recognition and classification tasks.
Our results show near state-of-the-art performance can be achieved on the disease name recognition task.More significantly, the proposed models obtain significantly improved performance on the disease name classification task.

Methods
We first give overview of the complete model used for the two tasks.Next we explained embedded features used in different neural network models.We provide short description of different RNN models in the section 2.3.Training and inference strategies are explained in the section 2.4.

Model Architectures
Similar to any named entity recognition task, we formulate the disease mention recognition task as a token level sequence tagging problem.Each word has to be labeled with one of the defined tags.We choose BIO model of tagging, where B stands for beginning, I for intermediate and O for outsider or other.This way we have two possible tags for all entities of interest, i.e., for all disease mentions, and one tag for other entities.
Generic neural architecture is shown in the figure 1.In the very first step, each word is mapped to its embedded features.We call this layer as embedding layer.This layer acts as input to hidden layers of RNN model.We study the three different RNN models, and have described them briefly in the section 2.3.Output of the hidden layers is then fed to the output layer to compute the scores for all tags of interest (Collobert et al., 2011;Huang et al., 2015).In output layer we are using sentence level log likelihood, to make inference.Table 1 briefly describes all notations used in the paper.

Features Distributed Word Representation (WE)
Distributed word representation or word embedding or simply word vector (Bengio et al., 2003;Collobert and Weston, 2008) character embedding matrix, where every column M cw i is a vector representation of corresponding character c i in C.
feature vector of word w (i) .
We get this after concatenating w (i) and y (i) score for i th word in sentence at output layer of neural network.Here j th element will indicate the score for t th j tag.W * * , U * * , V * *

Parameters of different neural networks
Table 1: Notation corpus.Word vectors are present in columns of matrix M we .We can get this vector by taking product of matrix M we and one hot vector of v i .
Here h (i) is the one hot vector representation of i th word in V. We use pre-trained 50 dimensional word vectors learned using skipgram method on a biomedical corpus (Mikolov et al., 2013b;Mikolov et al., 2013a;TH et al., 2015).

Character Level Word Embedding (CE)
Word embedding preserve syntactic and semantic information well but fails to seize morphological and shape information.However, for the disease entity recognition task, such information can play an important role.For instance, letter -oin the word gastroenteritis is used to combine various body parts gastro for stomach, enter for intestines, and itis indicates inflammation.Hence taken together it implies inflammation of stomach and intestines, where -itis play significant role in determining it is actually a disease name.
Character level word embedding was first introduced by (dos Santos and Zadrozny, 2014) with the motivation to capture word shape and morphological features in word embedding.Character level word embedding also automatically mitigate the problem of out of vocabulary words as we can embed any word by its characters through character level embedding.In this case, a vector is initialized for every character in the corpus.Then we learn vector representation for any word by applying CNN on each vector of character sequence of that word as shown in figure 2. These character vectors will get update while training RNN in supervised manner only.Since number of characters in the dataset is not high we assume that every character vectors will get sufficient updation while training RNN itself.Let {p 1 , c 1 , c 2 ...c M , p 2 } is sequence of characters for a word with padding at beginning and ending of word and let {a l , a 1 , a 2 ...a M , a r } is its sequence of character vector, which we obtain by multiplying M cw with one hot vector of corresponding character.To obtain character level word embedding we need to feed this in convolution neural network (CNN) with max pooling layer (dos Santos and Zadrozny, 2014 Here k is window size, q (m) is obtained by concatenating the vector of (k − 1)/2 character left to (k −1)/2 character right of c m .Same filter will be used for all window of characters and max pooling operation is performed on results of all.We learn 100 dimensional character embedding for all characters in a given dataset (avoiding case sensitivity) and 25 dimensional character level word embedding from character sequence of words.

Recurrent Neural Network Models
Recurrent Neural Network (RNN) is a class of artificial neural networks which utilizes sequential information and maintains history through its intermediate layers (Graves et al., 2009;Graves, 2013).We experiment with three different variants of RNN, which are briefly described in subsequent subsections.

Bi-directional Recurrent Neural Network
In Bi-RNN, context of the word is captured through past and future words.This is achieved by having two hidden components in the intermediate layer, as schematically shown in the fig 1.One component process the information in forward direction (left to right) and other in reverse direction.Subsequently outputs of these components then concatenated and fed to the output layer to get score for all tags of the considered word.Let x (t) is a feature vector of t th word in sentence (concatenation of corresponding embedding features w t i and y t i ) and h is the computation of last hidden state at (t − 1) th word, then computation of hidden and output layer values would be: Here

Bi-directional Long Short Term Memory Network
Traditional RNN models suffer from both vanishing and exploding gradient (Pascanu et al., 2012;Bengio et al., 2013).Such models are likely to fail where we need longer contexts to do the job.These issues were the main motivation behind the LSTM model (Hochreiter and Schmidhuber, 1997).LSTM layer is just another way to compute a hidden state which introduces a new structure called a memory cell (c t ) and three gates called as input (i t ), output (o t ) and forget (f t ) gates.
These gates are composed of sigmoid activation function and responsible for regulating information in memory cell.The input gate by allowing incoming signal to alter the state of the memory cell, regulates proportion of history information memory cell will keep.On the other hand, the output gate regulates what proportion of stored information in the memory cell will influence other neurons.Finally, the forget gate can modulate the memory cells and allowing the cell to remember or forget its previous state.Computation of memory cell (c (t) ) is done through previous memory cell and candidate hidden state (g (t) ) which we compute through current input and the previous hidden state.The final output of hidden state would be calculated based on memory cell and forget gate.
In our experiment we used model discussed in (Graves, 2013;Huang et al., 2015).Let x (t) is feature vector for t th word in a sentence and h (t−1) l is previous hidden state then computation of hidden (h (t) l ) and output layer (z (t) ) of LSTM would be.
Where σ is sigmoid activation function, * is a element wise product, where n I is input size (d we + d ce ) and n H is hidden layer size.We compute h (t) r in similar manner as h (t) l by reversing the all words of sentence.Let the parameter of output layer of LSTM then computation of output layer will be:

Bi-directional Gated Recurrent Unit Network
A gated recurrent unit (GRU) was proposed by (Cho et al., 2014) to make each recurrent unit to adaptively capture dependencies of different time scales.Similar to the LSTM unit, the GRU has gating units reset r and update z gates that modulate the flow of information inside the unit, however, without having a separate memory cells.The resulting model is simpler than standard LSTM models.
We follow (Chung et al., 2014) model of GRU to transform the extracted word embedding and character embedding features to score for all tags.Let x (t) embedding feature for tth word in sentence and h (t−1) l is computation of hidden state for (t−1)th word then computation of GRU would be: Where * is pair wise multiplication, U l , U l , U r is done in similar manner as h (t) l by reversing the words of sentence.

Training and Inference
Equations 3, 4 and 5 are the scores of all possible tags for t th word sentence.We follow sentencelevel log-likelihood (SLL) (Collobert et al., 2011) approach equivalent to linear-chain CRF to infer the scores of a particular tag sequence for the given word sequence.Let [w] |s| 1 is sentence and [t] |s| 1 is the tag sequence for which we want to find the joint score, then score for the whole sentence with the particular tag sequence would be: where W trans is transition score matrix and W trans i,j is indicating the transition score moving from tag t i to t j ; t j is tag for the j th word; z (i) t i is the output score from the neural network model for the tag t i of i th word.To train our model we used cross entropy loss function and adagrad (Duchi et al., 2010) approach to optimize the loss function.Entire neural network parameters, word embedding, character embedding and W trans (transition score matrix used in the SLL) was updated during training.Entire code has been implemented using theano (Bastien et al., 2012) library in python language.

Dataset
We used NCBI dataset (Dogan and Lu, 2012), the most comprehensive publicly available dataset annotated with disease mentions, in this work.NCBI dataset has been manually annotated by a group of medical practitioners for identifying diseases and their types in biomedical articles.All disease mentions were categorized into four different categories, namely, specific disease, disease class, composite disease and modifier.A word is annotated as specific disease, if it indicates a particular disease.Disease class category indicates a word describing a family of many specific diseases, such as autoimmune disorder.A string signifying two or more different disease mentions is annotated with composite mention.Modifier category indicates disease mention has been used as modifiers for other concepts.This dataset is a extension of the AZDC dataset (Leaman et al., 2009)

Results and Discussion
Evaluation of different models using CE We first evaluate the performance of different RNNs using only character embedding features.We compare the results of RNN models with window based neural network (Collobert et al., 2011) using sentence level log likelihood approach (NN + CE).For the window based neural network, we considered window size 5 (two words from both left and right, and one central word) and same settings of character embedding were used as features.The same set of parameters are used in all experiments unless we mention specifically otherwise.We used exact matching scheme to evaluate performance of all models.
Table 3 shows the results obtained by different RNN models with only character level word embedding features.For the task A (Disease name recognition) Bi-LSTM and NN models gave competitive performance on the test set, while Bi-RNN and Bi-GRU did not perform so well.On the other hand for the task B, there is 2.08% − 3.8% improved performance (F1-score) shown by RNN models over the NN model again on the test set.Bi-LSTM model obtained F1-score of 59.78% while NN model gave 57.56%.As discussed earlier, task B is difficult than task A as disease category is more likely to be influenced by the words falling outside the context window considered in window based methods.This could be reason for RNN models to perform well over the NN model.This hypothesis will be stronger if we observe similar pattern in our other experiments.

Evaluation of different models with WE and WE+CE
Next we investigated the results obtained by the various models using only 50 dim word embedding features.The first part of table 4 shows the results obtained by different RNNs and the window based neural network (NN).In this case RNN models are giving better results than the NN model for both the tasks.In particular performance of Bi-LSTM models are best than others in both the tasks.We observe that for the task A, RNN models obtained 1.2% to 3% improvement in F1-score than the baseline NN performance.Similarly 2.55% to 4% improvement in F1-score are observed for the task B, with Bi-LSTM model obtaining more than 4% improvement.
In second part of this table we compare the results obtained by various models using the features set obtained by combining the two feature sets.If we look at performance of individual model using three different set of features, model using only word embedding features seems to give consistently best performance.Among all models, Bi-LSTM using word embedding features obtained best F1-scores of 79.13% and 63.16% for the tasks A and B respectively.

Importance of tuning pre-trained word vectors
We further empirically evaluate the importance of updating of word vectors while training.For this, we performed another set of experiments, where pre-trained word vectors are not updated while At the end we are comparing our results with stateof-the art results reported in (Dogan and Lu, 2012) on this dataset using BANNER (Leaman and Gonzalez, 2008) (Dogan and Lu, 2012) (F1-score = 81.8) is better than that of our RNN models but it should be noted that competitive result (F1-score = 79.13%) is obtained by the proposed Bi-LSTM model which does not depend on any feature engineering or domainspecific resources and is using only word embedding features trained in unsupervised manner on a huge corpus.
For the task B, we did not find any paper except (Li, 2012).Li (2012) used linear soft margin support vector (SVM) machine with a number of hand designed features including dictionary based features.The best performing proposed model shows more than 37% improvement in F1-score (benchmark: 46% vs Bi-LSTM+WE: 63.16%).

Failure Analysis
To see where exactly our models failed to recognize diseases, we analyzed the results carefully.We found that significant proportion of errors are coming due to use of acronyms of diseases and use of disease form which is rarely appearing in our corpus.Examples of few such cases are "CD", "HNPCC","SCA1".We observe that this error is occurring because we do not have exact word embedding for these words.Most of the acronyms in the disease corpus were mapped to rare-word embedding 1 .Another major proportion of errors in our results were due to difficulty in recognizing nested forms of disease names.For example, in all of the following cases: "hereditary forms of 'ovarian cancer'" , "inherited 'breast cancer'", "male and female 'breast cancer'", part of phrase such as ovarian cancer in hereditary forms of ovarian cancer, breast cancer in inherited breast cancer and male and female breast cancer are disease names and our models are detecting this very well.However, according to annotation scheme if any disease is part of nested disease name, annotators considered whole phrase as a single disease.So even our model is able to detect part of the disease accurately but due to the exact matching scheme, this will be false positive for us.

Related Research
In biomedical domain, named entity recognition has attracted much attention for identification of entities such as genes and proteins (Settles, 2005;Leaman and Gonzalez, 2008;Leaman et al., 2009) but not as much for disease name recognition.Notable works, such as of Chowdhury and Lavelli (2010), are mainly conditional random field (CRF) based models using lots of manually designed template features.These include linguistic, orthographic, contextual and dictionary based features.However, they have evaluated their model on the AZDC dataset which is small compared to 1 we obtained pre-trained word-embedding features from (TH et al., 2015) and in their pre-processing strategy, all words of frequency less than 50 were mapped to rare-word.
the NCBI dataset, which we have considered in this study.Nikfarjam et al. (2015) have proposed a CRF based sequence tagging model, where cluster id of embedded word as an extra feature with manually engineered features is used for adverse drug reaction recognition in tweets.
Recently deep neural network models with minimal dependency on feature engineering have been used in few studies in NLP including NER tasks (Collobert et al., 2011;Collobert and Weston, 2008).dos Santos et al. ( 2015) used deep neural network based model such as window based network to recognize named entity in Portuguese and Spanish texts.In this work, they exploit the power of CNN to get morphological and shape features of words in character level word embedding, and used it as feature with concatenation of word embedding.Their results indicate that CNN are able to preserve morphological and shape features through character level word embedding.Our models are quite similar to this model but we used different variety of RNN in place of window based neural network.Labeau et al. (2015) used Bi-RNN with character level word embedding only as a feature for PoS tagging in German text.Their results also show that with only character level word embedding we can get state-of-art results in PoS tagging in German text.Our model used word embedding as well as character level word embedding together as features and also we have tried more sophisticated RNN models such as LSTM and GRU in bi-directional structure.More recent work of Huang et al. (2015) used LSTM and CRF in variety of combination such as only LSTM, LSTM with CRF and Bi-LSTM with CRF for PoS tagging, chunking and NER tasks in general texts.Their results shows that Bi-LSTM with CRF gave best results in all these tasks.These two works have used either Bi-RNN with character embedding features or Bi-LSTM with word embedding features in general or news wire texts, whereas in this work we compare the performance of three different types of RNNs: Bi-RNN, Bi-GRU and Bi-LSTM with both word embedding and character embedding features in biomedical text for disease name recognition.

Conclusions
In this work, we used three different variants of bidirectional RNN models with word embedding features for the first time for disease name and class recognition tasks.Bidirectional RNN models are used to capture both forward and backward long term dependencies among words within a sentence.We have shown that these models are able to obtain quite competitive results compared to the benchmark result on the disease name recognition task.Further our results have shown a significantly improved results on the relatively harder task of disease classification which has not been studied much.All our results were obtained without putting any effort on feature engineering or requiring domain-specific knowledge.Our results also indicate that RNN based models perform better than window based neural network model for the two tasks.This could be due to the implicit ability of RNN models to capture variable range dependencies of words compared to explicit dependency on context window size of window based neural network models.

Figure 1 :
Figure 1: Generic bidirectional recurrent neural network with sentence level log likelihood at the top-layer for sequence tagging task

Figure 2 :
Figure 2: CNN with Max Pooling for Character Level Embedding (p 1 and p 2 are padding).Here, filter length is 3.
(t) l by reversing the words in the sentence.At the beginning h (0) l and h (0) r are initialized randomly.
where n I is input vector of length d we + d ce , n H is hidden layer size and V ∈ R n O ×(n H +n H ) is the output layer parameter.h r is calculated similarly to h

Table 2 :
which was annotated with disease mentions only and not with their categories.Statistics of the dataset is mentioned in the Table2.Dataset statistics.spe.dis.: specific disease and comp.men.: composite mentionIn our evaluation we used this dataset in two settings, A: disease mention recognition, where all

Table 4 :
Performance of various models using 50 dimensional WE features.A:Disease name recognition, B: Disease classification task training.Results obtained on the validation dataset of the Task A are shown in the Table5.One can observe that performance of all models have dete-

Table 6 :
in table 7. BANNER is a CRF based bio entity recognition model, which uses general linguistic, orthographic, syntactic dependency fea-Results of different models with 50 dim random vectors in Task A validation set tures.Although the result reported in

Table 7 :
Comparisons of our best model results and state-of-art results.SM-SVM :Soft Margin Support Vector Machine