A Neural Pipeline Approach for the PharmaCoNER Shared Task using Contextual Exhaustive Models

We present a neural pipeline approach that performs named entity recognition (NER) and concept indexing (CI), which links them to concept unique identifiers (CUIs) in a knowledge base, for the PharmaCoNER shared task on pharmaceutical drugs and chemical entities. We proposed a neural NER model that captures the surrounding semantic information of a given sequence by capturing the forward- and backward-context of bidirectional LSTM (Bi-LSTM) output of a target span using contextual span representation-based exhaustive approach. The NER model enumerates all possible spans as potential entity mentions and classify them into entity types or no entity with deep neural networks. For representing span, we compare several different neural network architectures and their ensembling for the NER model. We then perform dictionary matching for CI and, if there is no matching, we further compute similarity scores between a mention and CUIs using entity embeddings to assign the CUI with the highest score to the mention. We evaluate our approach on the two sub-tasks in the shared task. Among the five submitted runs, the best run for each sub-task achieved the F-score of 86.76% on Sub-task 1 (NER) and the F-score of 79.97% (strict) on Sub-task 2 (CI).


Introduction
The PharmaCoNER (Gonzalez-Agirre et al., 2019) shared task 1 is an open challenge that allows participants to use any methodology and knowledge sources for the clinical records with protected health information. The task aims at two sub-tasks in pharmaceuticals drug and clinical domain: named entity recognition (NER), which is officially called NER offset and entity classification, and concept indexing (CI). Among these sub-1 https://2019.bionlp-ost.org/tasks tasks, we focus on NER since NER has drawn considerable attentions as the first step towards many natural language processing (NLP) applications including relation extraction (Miwa and Bansal, 2016), event extraction (Feng et al., 2016), and coreference resolution (Fragkou, 2017). Recently, deep neural networks have shown impressive performance on named entity recognition in several domains (e.g., Lample et al. (2016)). Such models achieved state-of-the-art results without requiring any hand-crafted features or external knowledge resources.
In this paper, we present a pipeline approach that addresses both NER and CI. We mainly focus on NER and employ a neural exhaustive model (Sohrab and Miwa, 2018;Sohrab et al., 2019) for NER. The model detects flat and nested entities by reasoning over all the spans within a specified maximum size. Unlike the existing models that rely on token-level labels, our model directly employs an entity type as the label of a span. Each span is represented as the combination of the boundary and inside representations by using the outputs of bidirectional long short-term memory (Bi-LSTM). We employ and compare different span representations following (Sohrab and Sohrab et al., 2019) that leads to propose a new contextual exhaustive models. The original model (Sohrab and Miwa, 2018) simply treated all the tokens in a span equally by taking the average of LSTM outputs corresponding to tokens inside the span for inside representation and concatenated them with boundary representation where context of each span is totally ignored. Sohrab et al. (2019) proposed several extensions for the representation including contextual span representations and several different inside representations. In this approach, the contextual span representations are considered to capture only the previous and next time steps of LSTM output of a target span, where the surrounding context of a sequence from beginning to target span and end to target span as forward-and backward-context are ignored. Unlike the previous methods (Sohrab Sohrab et al., 2019), the proposed contextual exhaustive approach captures the surrounding context representation of a given sequence by capturing the forward-and backwardcontext of Bi-LSTM output of a target span; we describe the details in Section 3.1.3. Besides, the contextual exhaustive approach is extended to leverage the output of a morphological analyser. The spans with the representations are classified into their entity types or non-entity. With the mentions predicted by the NER module, we map them to a knowledge base (KB) (i.e., SNOMED-CT) by direct dictionary matching and similarity scores between mentions and the names of their candidate CUI terms. The best run for each subtask achieved the F-score of 86.76% on sub-task 1 (NER) and the F-scores of 79.97% on sub-task 2 (CI).

Related Work
Most NER work focus on flat entities. Lample et al. (2016) proposed a LSTM-CRF (conditional random fields) model and this has been widely used and extended for the flat NER, e.g., Akbik et al. (2018). In recent studies of neural network based flat NER, Gungor et al. (2018Gungor et al. ( , 2019 have shown that morphological analysis using additional word representations based on linguistic properties of the words, especially for morphologically rich languages such as Turkish and Finnish, improves the NER performances further compared with using only representations based on the surface forms of words. Recently, nested NER has been widely interested in NLP. Zhou et al. (2004) detected nested entities in a bottom-up way. They detected the innermost flat entities and then found other NEs containing the flat entities as sub-strings using rules on the detected entities. The authors reported an improvement of around 3% in the Fscore under certain conditions on the GENIA data set (Collier et al., 1999). Recent studies show that the conditional random fields (CRFs) can produce significantly higher tagging accuracy in flat or nested (stacking flat NER to nested representation) NERs (Son and Minh, 2017). Ju et al. (2018) proposed a novel neural model to address nested entities by dynamically stacking flat NER layers until no outer entities are extracted. A cascaded CRF layer is used after the LSTM output in each flat layer. The authors reported that the model outperforms state-of-the-art results by achieving 74.5% in F-score on the GENIA data set. Sohrab and Miwa (2018) proposed a neural model that detects nested entities using exhaustive approach and outperforms the state-of-the-art results by achieving 77.1% in terms of F-score on the GENIA data set. Sohrab et al. (2019) further extended the span representations for entity recognition and addressed sensitive span detection tasks in the MEDDOCAN (MEDical DOCument ANonymization) shared task 2 , and the system achieved 93.12% and 93.52% in terms of Fscore for NER and sensitive span detection, respectively.

Pipeline Approach for NER and Concept Indexing
The pipeline approach consists of two modules: • Named entity recognition that uses a contextual neural exhaustive approach • Concept indexing (CI) that generates the list of unique SNOMED concept identifiers of the mentions that are detected by the NER module for each document.

Neural Named Entity Recognition
We solve the NER task, first by employing a neural exhaustive model (Sohrab and Miwa, 2018;Sohrab et al., 2019) that leads to implement a new contextual exhaustive approach, exhaustively considers all possible contextual spans in a sentence using a single neural network. The model detects nested entities by enumerating all possible contextual spans. The model is built upon a shared bidirectional LSTM (Bi-LSTM) layer, and we consider several different representations for the contextual span using the outputs of Bi-LSTM. Figure 1 shows the contextual exhaustive model to detect the possible mentions. The proposed neural contextual exhaustive model consists of embedding, bidirectional LSTM and exhaustive layers. we will explain each layer in the following subsections.

Embedding Layer
In the embedding layer, each word is represented by concatenating the pre-trained word embedding and character-based word representation, where we encode the character-level information of the word. The character-based word representation is obtained by feeding the sequence of character embeddings comprising a word to Bi-LSTM and concatenate the forward and backward output representations. Besides, we leverage the morphological analyzer 3 to generate morphological tags, where the tag for each input word is generated by merging the lemma and part-of-speech tag of the word. Then each tag produced by the morphological analyzer is treated as a sequence of characters and encoded using the character-level information using randomly initialized character embeddings. Specifically, we fed the sequence to a separate Bi-LSTM and concatenate the forward and backward outputs to obtain the morphological representation of a word.

Bidirectional LSTM Layer
Given an input sentence sequence X = {x 1 , ..., x n } where x i denotes the i-th word and n denotes the number of words in the sentence sequence, the distributed embeddings of the words 3 https://github.com/PlanTL-SANIDAD/ SPACCC_POS-TAGGER in the sequence from the embedding layer are fed into a Bi-LSTM layer. The Bi-LSTM layer computes the hidden vector sequence in forward We concatenate the forward and backward outputs as where [; ] denotes concatenation.

Exhaustive Layer
The exhaustive layer enumerates all possible spans by exhaustive combination. We generate all possible spans with the sizes less than or equal to the maximum span size L, which is a predefined hyper-parameter. We use (i, k) to represent the span from i to k inclusive, where 1 ≤ i < k ≤ n and k − i < L. We represent each span using the outputs of the shared underlying LSTM layer and represent span with different ways as in explained later. We then feed the representation of each segmented span to a rectified linear unit (ReLU) as an activation function. Finally, the output of the activation layer is passed to a softmax output layer to classify the span into a specific entity type.
In the latter part of this section, we introduce the span representations and its several enhancements.
Contextual Span Representations with Averaging For contextual span representations (Sohrab et al., 2019), we represent the span with three separate representations: the surrounding context representation, the boundary representation for span detection and the inside representation for semantic type classification. We capture the context representation of a given sequence from Bi-LSTM output h i . Specifically, we obtain the contextual span representation by capturing the forward-and backward-context of Bi-LSTM output of a target span (i, k) by concatenating vector output of previous − → h i−1 in forward manner, and output of previous ← − h i−1 in backward manner. The boundary representation is prepared to capture both ends of the span. For this, we rely on the outputs of the Bi-LSTM layer corresponding to the boundary words of a target span. The inside representation is prepared to capture its semantic type by encoding the whole semantic information of the span. We use the average of all the outputs corresponding to the words in the span for the inside representation. Following the above contextual, boundary, and inside representations, we represent the representation R(i, k) [F,L,A,R,B] (Forward-context, Leftboundary, inside with Average, Right-boundary, and Backward-context) of the span (i, k) as follows: (1) Contextual Span Representations using Attention We also try an attention mechanism (Bahdanau et al., 2015) instead of the average over words in each span. Specifically, we replace the inside representations using attention mechanism as follows: where ← → x t is the concatenated output of the Bi-LSTM layer over a span. x i is a weighted sum of word vectors in span (i, k). Instead of Equation (1), we obtain the representation R(i, k) [F,L,A,R,B] (A for inside with Attentionbased representation) of the span (i, k) as follows: Contextual LSTM-Minus-based Span Representations We also try LSTM-Minus (Wang and Chang, 2016) for the boundary representation 4 . The left boundary is computed as the representation of the previous word of the span subtracted from the representation of the last word of the current span. Similarly, the right boundary is computed as the representation of the next word of the span subtracted from the representation of the first word of the current span. In contextual LSTM-Minus-based span representations of an input sequence, we compute the forward-and backwardcontext of a target span as the same manner that stated to represent the forward-and backwardcontext representations of R(i, k) [F,L,A,R,B] . We obtain the representation R(i, k) [F,L,A,R,B] (L and R for Left-and Right-boundary based on LSTM-Minus, respectively) of the span (i, k) as follows: Furthermore, the LSTM-Minus based representation using attention can be considered as: Base Span Representations We further consider representations without context representation (Sohrab and Miwa, 2018), which we denote base span representations. For the base span representations, we generate representations by eliminating forward-and backward-context from Equations (1), (5)-(7) and they can be rewritten respectively as:

Concept Indexing
The concept indexing (CI) requires to identify a concept unique identifier (CUI) for every mention span of a concept in a document. SNOMED-CT knowledge-base is used to extract all candidates CUI and its term names. For CI, the input is all predicted mention span M = {m 1 , m 2 , . . . , m n }, where m i denotes the i-th mention and n denotes the total number of predicted mentions. Each mention is represented as a word sequence m i = {w 1 , ..., w k }. Each CUI c is an entry in a knowledge base (KB) (i.e., SNOMED-CT). For the CI task, the list of entity mention {m i } i=1,...,T needs to be mapped to a list of corresponding CUIs {c i } i=1,...,T . Using the SNOMED-CT database, we first conduct dictionary look-up matching for each mention m i with CUIs' term names to retrieve an optimal CUI. If the CUI is not found for a mention, we then compute a similarity score using the dotproduct with entity embeddings that supposedly should capture possible related CUIs and select the maximum score to predict the optimal CUI for a mention.
We use fixed, continuous, task-specific entity embeddings, namely the pre-trained entity embeddings of Spanish SNOMED-CT KB by extracting all CUIs term name using GloVe (Pennington et al., 2014). For the multi-token term name of a CUI, we simply compute the average embeddings.

Experimental Settings
We provide empirical evidence on the effectiveness of the pipeline architecture in both NER and concept indexing on the PharmaCoNER 5 task of 5 http://temu.bsc.es/pharmaconer/ the BioNLP-OST 2019 6 . The PharmaCoNER corpus with four entity types 7 is randomly split into three subsets: train, development and test sets, which contain 500, 250 and 250 clinical cases, respectively.
Our model is implemented in the Chainer 8 deep learning framework. We employed the official PharmCoNER evaluation script 9 to evaluate our system's performances on both tasks.

Data Pre-processing
Each text and the corresponding annotation file were processed by several simple rules only for tokenization. 10 After tokenization, each text with mapping annotation files were directly passed to the deep neural approach for mention detection and classification. Note that the offsets were restored to the original offsets in evaluation.

Hyper-parameters
Word representations We generated task specific word embeddings of Spanish PharmaCoNER corpus by merging the raw text of training, development, and test (including background set) sets using GloVe (Pennington et al., 2014). We set the dimension of word embeddings to 200, the dimension of character embeddings for character encoding to 25, and character embeddings for morphological analysis to 25.
Hidden dimensions The hidden states in the LSTMs had 200 dimensions. Each feed forward neural network consisted of two hidden layers with 150 dimensions.
Learning We chose Adam (Kingma and Ba., 2015) as the optimization algorithm with a minibatch size of 10. We used the same hyperparameters in all the experiments; we set the gra-6 https://2019.bionlp-ost.org/ 7 (NORMALIZABLES: mentions of chemicals that can be manually normalized to a unique concept identifier, NO NORMALIZABLES: mentions of chemicals that could not be normalized manually to a unique concept identifier, PROTEINAS: mentions of proteins, genes, peptides, peptide hormones and antibodies, and UNCLEAR: cases of general substance class mentions of clinical and biomedical relevance) 8 https://chainer.org/ 9 https://github.com/PlanTL-SANIDAD/ PharmaCoNER-Evaluation-Script 10 Unlike the traditional NER models, our model is independent from traditional 'BIO' tagging scheme, where 'B', 'I', and 'O' stand for 'Begin', 'Inside', and 'Outside' of named entities respectively, so we do not need to assign such tags to the tokens.   (Kingma and Ba., 2015). The model was trained for up to 10 epochs, with early stopping based on the performance on the development set.

Results and Discussions
In order to evaluate the performance of NER and concept indexing, we conducted experiments on different sets of span representations, including contextual span representation (CSR) with averaging (CSR-Avg), CSR using attention (CSR-Attn), contextual LSTM-Minus-based span representations (CLM) with averaging (CLM-Avg), CLM using attention (CLM-Attn). Besides for base span representations (BSR), BSR with averaging (BSR-Avg), BSR using attentions (BSR-Attn), base LSTM-Minus-based span representation (BLM) with averaging (BLM-Avg), BLM using attention (BLM-Attn) are also considered. We also report the result of ensemble learning that combines the predictions using different span representations to reduce the variance of predictions and reduce the generalization error. Table 1 shows the five submitted results of NER and CI in terms of F-score on the test set. The top five span representations are chosen based on development score to submit the results. In this table, it is shown that the ensemble approach using maximum voting of all the approaches is effective to improve the system performance both in NER and CI tasks with achieving 86.67% in terms of F-score on NER. In contrast, the CSR-Attn shows the best performance as an individual span representation on NER with achieving 86.34% in terms of F-score.
In the CI task, the ensemble approach shows the best performance by achieving 79.97% in terms of F-score. CSR-Attn achieved 79.95% in terms of F-score as the best individual span representation. The pipeline approach may not be a perfect solution to solve the concept indexing task, where wrong predictions from the NER module will affect the results in the second step. Table 2 shows the categorical performances using ensemble learning of NER on the test set. In this table, we also break down the number of predicted and correct mentions among the gold annotations. In this table, it can be observed that for the classes of NORMALIZABLES and PROTEINAS, the model shows high performance because there are a reasonable number of training instances for the classes and the mentions in these two classes appeared in the same documents. In contrast, for the rare classes UNCLEAR and NO NORMALIZABLES, the performances are low. This may be partly due to their low frequency in the training set, making it hard to learn their representation in the network.

Ablation Study
We show the performances of different NER models for Sub-tasks 1 and 2 on the development set in Table 3 to compare the possible scenarios of the   given solutions and to report the best system submissions for NER and CI. The Sub-tasks 1 and 2 results in Table 3 shows that almost all the results in different approaches are close to each other to solve the Sub-tasks 1 and 2. The top four models (i.e., CSR-Attn, CLM-Attn, CSR-Avg, and BLM-Attn) and the ensemble of eight models are considered for test evaluation. As for the single NER model, the results on Sub-tasks 1 and 2 in Table 3 show that attention performs better than averaging when the other settings are same. LSTM-Minus helps when there is no contextual information, but it does not help when there is contextual information.
In the CI task on development set, the ensemble approach shows the best performance by achieving 77.36% in terms of F-score. CLM-Attn achieved 77.20% in terms of F-score as the best individual span representation. Table 4 shows the categorical performances using ensemble learning of NER on the development set. In this table, it seems that the model is well generalized to detect the mentions of each classes including rare classes such as UN-CLEAR and NO NORMALIZABLES on development set. The categorical performances of NORMALIZABLES and PROTEINAS in terms of F-score are dropped marginally from devel-opment to test scores by 1.22% and 4.33%, respectively. But it is surprising that the categorical performances of the rare classes UN-CLEAR and NO NORMALIZABLES, where the performances in terms of F-score are significantly dropped by 14.16% and 51.05% respectively, that affect the overall F-score of test set. We remain this analysis for our future work.

Conclusion
This paper presented a pipeline approach that integrates the contextual that captures the surrounding context of a target span and non-contextual neural exhaustive models, which consider all possible spans exhaustively, for named entity recognition (NER) and dictionary and similarity scorebased matching for concept indexing (CI), without depending on any external NLP tools. The proposed contextual exhaustive model is capable to detect flat and nested entities from the generated mention candidates of all possible spans. The model obtains the representation of each span using the outputs of the underlying shared bidirectional LSTM layer, and it represents the different spans by concatenating forward-and backwardcontext, boundary and inside representations of the span. Several enhancements, namely contextual span representation, average representation, attention mechanism, LSTM-Minus, and ensembling are investigated for the representations. It then classifies the span into an entity type or nonentity. To predict the concept unique identifier (CUI) of a mention, the system performs dictionary matching and then computes a similarity score for a mention with no matching using entity embeddings. Among the five submitted runs, the best run for each Sub-task achieved the F-score of 86.76% on Sub-task 1 (NER) and the F-scores of 79.97% on Sub-task 2 (CI).
In the future direction, we will implement a joint modeling that directly recognize entity mentions and link them to a concept unique identifier in an end-to-end manner.