An Investigation of Recurrent Neural Architectures for Drug Name Recognition

Drug name recognition (DNR) is an essential step in the Pharmacovigilance (PV) pipeline. DNR aims to find drug name mentions in unstructured biomedical texts and classify them into predefined categories. State-of-the-art DNR approaches heavily rely on hand crafted features and domain specific resources which are difficult to collect and tune. For this reason, this paper investigates the effectiveness of contemporary recurrent neural architectures - the Elman and Jordan networks and the bidirectional LSTM with CRF decoding - at performing DNR straight from the text. The experimental results achieved on the authoritative SemEval-2013 Task 9.1 benchmarks show that the bidirectional LSTM-CRF ranks closely to highly-dedicated, hand-crafted systems.


Introduction
Pharmacovigilance (PV) is defined by the World Health Organization as the science and activities concerned with the detection, assessment, understanding and prevention of adverse effects of drugs or any other drug-related problems. Drug name recognition (DNR) is a fundamental step in the PV pipeline, similarly to the well-studied Named Entity Recognition (NER) task for general natural language processing (NLP). DNR aims to find drug mentions in unstructured biomedical texts and classify them into predefined categories in order to link drug names with their effects and explore drug-drug interactions (DDIs). Conventional approaches to DNR sub-divide as rule-based, dictionary-based and machine learning-based. Intrinsically, rule-based systems are hard to scale, time-consuming to assemble and ineffective in the presence of informal sentences and abbreviated phrases. Dictionarybased systems identify drug names by matching text chunks against drug dictionaries. These systems typically achieve high precision, but suffer from low recall (i.e., they miss a significant number of mentions) due to spelling errors or drug name variants not present in the dictionaries (Liu et al., 2015a). Conversely, machine-learning approaches have the potential to overcome all these limitations since their foundations are intrinsically robust to variants. The current state-of-the-art machine learning approaches follow a two-step process of feature engineering and classification (Segura-Bedmar et al., 2015;Abacha et al., 2015;Rocktäschel et al., 2013). Feature engineering refers to the task of representing text by dedicated numeric vectors using domain knowledge. Similarly to the design of rule-based systems, this task requires much expert knowledge, is typically challenging and time-consuming, and has a major impact on the final accuracy. For this reason, this paper explores the performance of contemporary recurrent neural networks (RNNs) at providing end-to-end DNR straight from text, without any manual feature engineering stage. The tested RNNs include the popular Elman and Jordan networks and the bidirectional long short-term memory (LSTM) with decoding provided by a conditional random field (CRF) (Elman, 1990;Jordan, 1986;Lample et al., 2016;Collobert et al., 2011). The experimental results over the SemEval-2013 Task 9.1 benchmarks show an interesting accuracy from the LSTM-CRF that exceeds that of various manuallyengineered systems and approximates the best result in the literature.

Related Work
Most of the research on drug name recognition to date has focussed on domain-dependent aspects and specialized text features. The benefit of leveraging such tailored features was made evident by the results from the SemEval-2013 Task 9.1 (Recognition and classification of pharmacological substances, known as DNR task) challenge. The system that ranked first, WBI-NER (Rocktäschel et al., 2013), adopted very specialized features derived from an improved version of the ChemSpot tool (Rocktäschel et al., 2012), a collection of drug dictionaries and ontologies. Similarly, many other recent approaches (Abacha et al., 2015;Liu et al., 2015b;Segura-Bedmar et al., 2015) have been based on various combinations of general and domain-specific features.
In the broader field of machine learning, the recent years have witnessed a rapid proliferation of deep neural networks, with unprecedented results in tasks as diverse as visual, speech and named-entity recognition (Hinton et al., 2012;Krizhevsky et al., 2012;Lample et al., 2016). One of the main advantages of neural networks is that they can learn the feature representations automatically from the data, thus avoiding the laborious feature engineering stage (Mesnil et al., 2015;Lample et al., 2016). Given these promising results, the main goal of this paper is to provide the first performance investigation of popular RNNs such as the Elman and Jordan networks and the bidirectional LSTM-CRF over DNR tasks.

The Proposed Approach
DNR can be formulated as a joint segmentation and classification task over a predefined set of classes. As an example, consider the input sentence provided in Table 1. The notation follows the widely adopted in/out/begin (IOB) entity representation with, in this instance, Cimetidine as the drug, ALFENTA as the brand, and words volatile inhalation anesthetics together as the group. In this paper, we approach the DNR task by recurrent neural networks and we therefore provide a brief description hereafter. In an RNN, each word in the input sentence is first mapped to a random real-valued vector of arbitrary dimension, d. Then, a measurement for the word, noted as x(t), is formed by concatenating the word's own vector with a window of preceding and following vectors (the "context"). An example of input vector with a context window of size s = 3 is: (1) where w 3 (t) is the context window centered around the t-th word, ′ reduces ′ , and x word represents the numerical vector for word.
For the Elman network, both x(t) and the output from the hidden layer at time t − 1, h(t − 1), are input into the hidden layer for frame t. The recurrent connection from the past time frame enables a shortterm memory, while hidden-to-hidden neuron connections make the network Turing-complete. This architecture, common in RNNs, is suitable for prediction of sequences. Formally, the hidden layer is described as: where U and V are randomly-initialized weight matrices between the input and the hidden layer, and between the past and current hidden layers, respectively. Function f (·) is the sigmoid function: that adds non-linearity to the layer. Eventually, h(t) is input in the output layer: and convolved with the output weight matrix, W . The output is normalized by a multi-class logistic function, g(·), to become a proper probability over the class set. The output dimensionality is therefore determined by the number of entity classes (i.e., 4 for the DNR task).The Jordan network is very similar to the Elman network, except that the feedback   is sourced from the output layer rather than the previous hidden layer: Although the Elman and Jordan networks can learn long-term dependencies, their exponential decay biases them toward their most recent inputs (Bengio et al., 1994).
The LSTM was designed to overcome this limitation by incorporating a gated memory-cell to capture long-range dependencies within the data (Hochreiter and Schmidhuber, 1997). In the bidirectional LSTM, for any given sentence, the network computes both a left, − → h (t), and a right, ← − h (t), representations of the sentence context at every input, x(t). The final representation is created by concatenating them as . All these networks utilize the h(t) layer as an implicit feature for entity class prediction: although this model has proved effective in many cases, it is not able to provide joint decoding of the outputs in a Viterbi-style manner (e.g., an I-group cannot follow a B-brand; etc). Thus, another modification to the bidirectional LSTM is the addition of a conditional random field (CRF) (Lafferty et al., 2001) as the output layer to provide optimal sequential decoding. The resulting network is commonly referred to as the bidirectional LSTM-CRF (Lample et al., 2016).

Datasets
The DDIExtraction 2013 shared task challenge from SemEval-2013 Task 9.1 (Segura-Bedmar et al., 2013) has provided a benchmark corpus for DNR and DDI extraction. The corpus contains manually-annotated pharmacological substances and drug-drug interactions (DDIs) for a total of 18, 502 pharmacological substances and 5, 028 DDIs.
It collates two distinct datasets: DDI-DrugBank and DDI-MedLine . Table 2 summarizes the basic statistics of the training and test datasets used in our experiments. For proper comparison, we follow the same settings as (Segura-Bedmar et al., 2015), using the training data of the DNR task along with the test data for the DDI task for training and validation of DNR. We split this joint dataset into a training and validation sets with approximately 70% of sentences for training and the remaining for validation.

Evaluation Methodology
Our models have been blindly evaluated on unseen DNR test data using the strict evaluation metrics. With this evaluation, the predicted entities have to match the ground-truth entities exactly, both in boundary and class. To facilitate the replication of our experimental results, we have used a publicly-available library for the implementation 1 (i.e., the Theano neural network toolkit (Bergstra et al., 2010)). The experiments have been run over a range of values for the hyper-parameters, using the validation set for selection (Bergstra and Bengio, 2012). The hyperparameters include the number of hidden-layer nodes, H ∈ {25, 50, 100}, the context window size, s ∈ {1, 3, 5}, and the embedding dimension, d ∈
The embedding and initial weight matrices were all sampled from the uniform distribution within range [−1, 1]. Early training stopping was set to 100 epochs to mollify over-fitting, and the model that gave the best performance on the validation set was retained. The accuracy is reported in terms of microaverage F 1 score computed using the CoNLL score function (Nadeau and Sekine, 2007). Table 3 shows the performance comparison between the explored RNNs and state-of-the-art DNR systems. As an overall note, the RNNs have not reached the same accuracy as the top system, WBI-NER (Rocktäschel et al., 2013). However, the bidirectional LSTM-CRF has achieved the second-best score on DDI-DrugBank and the third-best on DDI-MedLine. These results seem interesting on the ground that the RNNs provide DNR straight from text rather than from manually-engineered features. Given that the RNNs learn entirely from the data, the better performance over the DDI-DrugBank dataset is very likely due to its larger size. Accordingly, it is reasonable to expect higher relative performance should larger corpora become available in the future. Table 4 also breaks down the results by entity class for the bidirectional LSTM-CRF. The low score on the brand class for DDI-MedLine and on the drug n class (i.e., active substances not approved for human use) for DDI-DrugBank are likely attributable to the very small sample size (Table 2). This issue is also shared by the state-of-the-art DNR systems.

Conclusion
This paper has investigated the effectiveness of recurrent neural architectures, namely the Elman and Jordan networks and the bidirectional LSTM-CRF, for drug name recognition. The most appealing feature of these architectures is their ability to provide end-to-end recognition straight from text, sparing effort from laborious feature construction. To the best of our knowledge, ours is the first paper to explore RNNs for entity recognition from pharmacological text. The experimental results over the SemEval-2013 Task 9.1 benchmarks look promising, with the bidirectional LSTM-CRF ranking closely to the state of the art. A potential way to further improve its performance would be to initialize its training with unsupervised word embeddings such as Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014). This approach has proved effective in many other domains and still dispenses with expert annotation effort; we plan this exploration for the near future.