An ensemble CNN method for biomedical entity normalization

Different representations of the same concept could often be seen in scientific reports and publications. Entity normalization (or entity linking) is the task to match the different representations to their standard concepts. In this paper, we present a two-step ensemble CNN method that normalizes microbiology-related entities in free text to concepts in standard dictionaries. The method is capable of linking entities when only a small microbiology-related biomedical corpus is available for training, and achieved reasonable performance in the online test of the BioNLP-OST19 shared task Bacteria Biotope.


Introduction
With over 500K papers in the biomedical field published on average every year 2 , it is important to promote efficient information retrieval and knowledge processing from the literatures automatically. Named entity recognition (NER), which extracts meaningful real-world objects from free text, and entity normalization (entity linking), which links ambiguous or varied extracted objects to standard concepts, are two fundamental natural language processing (NLP) tasks to approach the goal.
With many attempts made for general entity normalization (Hachey, Radford et al. 2013, Luo, Huang et al. 2015, Wu, He et al. 2018, Aguilar, Maharjan et al. 2019, biomedical entity linking faces more challenges handling entity variations, making it an enthralling field to be explored. Many studies endeavored to solve biomedical entity normalization issues have been published (Hanisch, Fundel et al. 2005, Leaman and Lu 2016, Cho, Choi et al. 2017, Li, Chen et al. 2017, Luo, Song et al. 2018, Ji, Wei et al. 2019. Meanwhile, BioNLP Shared Tasks, one of the community-wide challenges that aim to find solutions for biomedical literature information retrieval, also addresses diverse tasks of entity linking (Bossy, Jourde et al. 2011, Bossy, Golik et al. 2013, Chaix, Dubreucq et al. 2016, Deléger, Bossy et al. 2016). However, further investigations are required to improve the performance of the entity linking systems, especially when the available corpus is small.
Here, we present a two-step neural networkbased ensemble method that links free text preannotated microbiology-related entities to standard concepts using semantic information from pre-trained word vectors. By integrating a perfect match method with a shallow CNN, our model's performance is comparable to the SOTA methods' performance when trained with a small biomedical corpus (2258 microbiology-related entities, or 1248 after de-duplication, from 198 microbiology related publications and reports) provided by the BioNLP-19 task Bacteria Biotope challenge. 3 We have compared our ensemble model to both a baseline method, of which we linked free text entities to the standard concepts by vector distance (Manning, Raghavan et al. 2010), and ABCNN, one of the SOTA models that could be used for entity normalization (Yin, Schütze et al. 2016). In addition, the method was tested online, and the results indicated that our model achieved a reasonable performance for microbiology-related entities linking tasks with small corpora. Most early studies utilized morphological similarity defined by editing distances between input terms and standard concepts to normalize the entities (Ristad andYianilos 1998, Aronson 2001).
Later, heuristic rules were incorporated to improve the performance ( presented a NER and normalization joint system utilizing semi-markov models, and it has been adopted by an integrated bioconcept annotation and retrieval platform developed by NIH (Wei, Allot et al. 2019). However, many of the studies achieved good performance yet were limited for further improvements due to the common drawbacks of rule-based methods. Approaches utilizing semantic information of the entities was made possible by the appearance of the word embedding technique. Word embedding projects words to vector spaces, where the cosine similarities between the vectors indicate their semantic similarities. The CONTES system (Ferré, Zweigenbaum et al. 2017) and the following HONOR system (Ferré, Deléger et al. 2018) performed entity linking tasks by minimizing the distances between embedded input terms and standard biomedical concepts. Karadeniz and Özgür (2019) proposed an unsupervised method for entity linking tasks using word embeddings and a syntactic parser.
Meanwhile, neural networks have been combined with word embeddings to normalize biomedical entities. Limsopatham and Collier (2016) applied convolutional neural network (CNN) and recurrent neural network (RNN) to pre-trained word embeddings to normalize medical concepts in social media texts, and achieved the SOTA performance on several datasets. Li et al. (2017) utilized a CNN structure to rank the candidates generated by rule-based methods. Deep neural networks such as multi-view CNN and BERT have also been proposed to normalize biomedical entities (Luo, Song et al. 2018, Ji, Wei et al. 2019). However, their applications might be limited due to the requirement of large amount of data.

Models
Our model architecture is shown in Figure 1, where our major work is highlighted in blue and further discussed in Section 3.1-3.3.
To process the entities from the standard dictionary, let ! " # be the $-th entity from the dictionary, and % "& ∈ ℝ ) be the * -dimensional word vector of the +-th word in the entity ! " # . the embedded vector , " of entity ! " # is defined as entity ! " # . The VSM was created from the biomedical scientific literature in the PubMed database and Wikipedia (Pyysalo 2013). * = 200. PCA was conducted to increase training efficiency, where 7 89:;9<=<> is the reduction rate.
The processing of the entities from free text is described in the following section in detail. To be noted, abbreviations are commonly seen in free text from publications and reports. For example, CNS, standing for the central nervous system, might often be used in research literature over topics of neuroscience, and would be annotated as entities to be linked. However, abbreviations, mostly derived from phrases, are often absent from the pre-trained word vector spaces and would interfere with the model training. To solve this problem, we first converted potential abbreviations in free text pre-annotated entity list to their long forms with Ab3P. Ab3P is an abbreviation detection tool developed specifically for biomedical concepts. It reached 96.5% precision and 83.2% recall on 1250 randomly selected MED-LINE records as suggested by Sohn et al (2008).
The converted free text pre-annotated entities were then matched with dictionary-derived standard concepts by characters through a perfect match module (Section 3.1). The entities failed perfect matching were then fed to a set of shallow CNN models (Section 3.2) trained with bootstrap samples. Next, the outputs of the CNNs were mapped to standard entity vectors via cosine similarity. The standard entity vectors output from the voting classifier (Section 3.3) were predicted as the linked results of the input entities.

Perfect match
We noticed that some entities from the free text were able to match with the standard entities by characters after rule-based processing. These entities were then directly linked to the dictionary instead of being fed to the Word2Vector and CNN models. The rules we designed include: • Hyphens were replaced with spaces.

•
Characters except alphabetic letters and spaces were removed.
• Case-insensitive string matching was performed between the free text entities and standard entities.

Shallow CNN
The shallow CNN (Figure 1) was adapted from the previous ideas from Kim (2014) and Limsopatham and Collier (2016) .
To start with, let H " I be the $ -th input entity (which were provided by the task), and J "& ∈ ℝ ) be the *-dimensional word vector of the +-th word in the entity H " I , * = 200. The embedded matrix K " of entity H " I is defined as K " = J "2 ⨁J "M ⨁ … J "3 4 .
Here 7 " ∈ ℕ O is the number of words present in the pre-trained VSM (Pyysalo 2013) in the entity H " I . ⨁ is the concatenation operator. K " is padded to length 8 as 98.8% of the input entities were composed of 8 or fewer words. For the entities with more than 8 words, average pooling was performed in prior with the pool size = (P, 200) and step= P, where P = 7/8 . In other words, simple average of the neighboring P words was calculated, so that the final embedded matrix would always have a length ≤ 8.
A temporal convolution kernel followed by a max-over-time pooling operation and a fully connected layer were applied to each K " . The output , " was then passed to a cosine similarity function to calculate the similarity scores between , " and each standard entity vector respectively. The standard concept with the highest score was predicted as the linked entity , " T .

Ensemble mechanism with voting
To reduce overfitting, we designed an ensemble method that combined 5 shallow CNNs with the concept of boosting (Valiant 1984). The 5 CNNs shared the identical architecture, but their weights were randomly initialized respectively. To increase the generalization capability of our model, the CNNs were fed with training data randomly subsampled with bootstrap method (Efron 1982), with the out-of-bag samples used for cross-validation. The final normalized results were achieved with a majority-vote classifier over the outputs from the 5 shallow CNNs. If no majority output was present, the output from the network with the best cross-validation estimates would be chosen.

Baseline model
For each entity U " # in the standard dictionary and U " I in free text, the corresponding embedded where 7 " ∈ ℕ O is the number of words present in the pre-trained VSM in the $-th entity.
Cosine similarity between each free text-dictionary entity pair was calculated. The free text entity was linked to the dictionary entity with the highest similarity score.

ABCNN
ABCNN (Yin, Schütze et al. 2016) is a state-ofthe-art deep learning model for text similarity learning, which could also be applied for entity linking tasks. The model introduced attention mechanism into a pair of siamese architecturebased weight sharing CNNs (Bromley, Guyon et al. 1994).
For our purpose, we used a slight variant of a published ABCNN model 4 . In addition, attention mechanism could be applied to different layers of the CNN pair according to the original publication. Considering the data volume and the model complexity, we applied the attention mechanism to the input layer.

Data and resources
The biomedical corpus and pre-annotated entities were provided by BioNLP-OST19 task Bacteria Biotope. Table 1 shows the detailed data statistics provided by the task. Two types of entities were involved in the task: phenotype, which describes microbial characteristics, and habitat, which describes physical places where microorganisms could be observed. Dictionary with 3602 standard concepts was also provided by the task. In the original dictionary, each concept is assigned to a unique ID, while its hierarchical information of its direct parents is also listed. In our model, the hierarchical information is omitted.
Ab3P-detected abbreviations were provided as separate input files by the task organizers 5 .
The 4 GB word vector space model was downloaded in binary format 6 and extracted with python package gensim.

Training
Our CNN model was trained using stochastic gradient descent optimizer with cosine proximity as the loss function. We randomly split 20% samples as validation dataset for each CNN and used early stopping criteria to determine the number of training epochs. The learning rate was fixed to 0.01. Batch size (2), kernel size (4) and filter number (5000) were determined by grid-search.
As expected with this small volume of data, extra convolution layers led to overfitting.

Held-out evaluation
We used precision metrics, the official metric of the challenge, to evaluate the performance of our model and the reference models on the held-out development dataset respectively. training corpus. Our ensemble CNN model performed around 3 times better than both reference models with an average precision score of 0.622, indicating the efficiency of our model. However, it should be noted that the perfect match module in our system had a remarkable higher precision score compared to the shallow CNN module, suggesting that the performance of the neural network could be further improved.
We analyzed the result of the shallow CNN module and concluded that 3 possible reasons might be associated with the performance: 1) Missing context. For example, our model normalized "children" to "child", while the provided label was "patient with infectious disease" in articles describing children with infectious disease. 2) Missing hierarchical information. For example, our model normalized "B cell" to "cell" instead of "lymphocyte", and the latter one was a more accurate description. Tackling the above two issues would require either the context or the hierarchical information of the standard concepts to be considered in the system. 3) Wrong match. For example, "cats" was normalized to "dog", suggesting that the networks were not trained well to normalize these words. However, we noticed that such errors mostly came with a majority vote of 2 or 1, which on the other hand demonstrated the power of the voting mechanism.

Online test
The ensemble CNN model was then evaluated through online testing 7 . Our results showed a 12.5% and 17.7% precision increase in habitat and phenotype entity linking tasks respectively compared to the challenged-provided baseline model (Table 3), where case-insensitive string matching was applied for linking. In addition, it performed the best or among the best ones compared to models proposed by other participants, suggesting the advantages of our model. We did not test our own reference models online due to the limited number of submissions to the challenge.

Conclusions and Future direction
We introduced a two-step neural network-based ensemble method that linked microbiology-related biomedical entities extracted from free text to standard concepts. The shallow architecture and ensemble mechanism on top of a perfectmatch morphological similarity method achieved reasonable predictions with limited training samples. The comparison with reference models suggested the efficiency of our model. In addition, our approach could be applied to other scenarios where semantic linking between entities is required as well.
Further improvement might be achieved once more semantic clues are incorporated, as we briefly discussed at the end of Section 4.3. The normalization deviation due to missing context clues did not only affect the performance of shallow CNN, but also affected the performance of perfect math as well. For example, though entity 'cell' has a perfect match in standard dictionary, it might be referred to 'lymphocytes' specifically in a research paper discussing about immunity. While some efforts have been made to preserve hierarchical information between concepts during entity linking (Ferré, Deléger et al. 2018), It would be interesting to investigate if knowledge graphs derived from the standard dictionaries and input corpus could contribute to the semanticbased entity normalization.
In addition, our model assigned the same weight to all the words present in the VSM, which might compromise the performance of the system. For example, only the word "children" is informative in the entity "children less than five years of age", as the entity is normalized to "child". The presence of other words might interfere with the normalization. Regarding this issue, syntactic parsers might be adopted for performance improvement.