A Neural Layered Model for Nested Named Entity Recognition

Entity mentions embedded in longer entity mentions are referred to as nested entities. Most named entity recognition (NER) systems deal only with the flat entities and ignore the inner nested ones, which fails to capture finer-grained semantic information in underlying texts. To address this issue, we propose a novel neural model to identify nested entities by dynamically stacking flat NER layers. Each flat NER layer is based on the state-of-the-art flat NER model that captures sequential context representation with bidirectional Long Short-Term Memory (LSTM) layer and feeds it to the cascaded CRF layer. Our model merges the output of the LSTM layer in the current flat NER layer to build new representation for detected entities and subsequently feeds them into the next flat NER layer. This allows our model to extract outer entities by taking full advantage of information encoded in their corresponding inner entities, in an inside-to-outside way. Our model dynamically stacks the flat NER layers until no outer entities are extracted. Extensive evaluation shows that our dynamic model outperforms state-of-the-art feature-based systems on nested NER, achieving 74.7% and 72.2% on GENIA and ACE2005 datasets, respectively, in terms of F-score.


Introduction
The task of named entity recognition (NER) involves the extraction from text of names of entities pertaining to semantic types such as person (PER), location (LOC) and geo-political entity (GPE). NER has drawn the attention of many researchers as the first step towards NLP applications such as entity linking (Gupta et al., 2017), relation extraction (Miwa and Bansal, 2016), event 1 Code is available at https://github.com/ meizhiju/layered-bilstm-crf   (Walker et al., 2006) containing the nested 4 entities nested 3 levels deep.
Due to the properties of natural language, many named entities contain nested entities: embedded names which are included in other entities, illustrated in Figure 1. This phenomenon is quite common in many domains (Alex et al., 2007;Byrne, 2007;Wang, 2009;Màrquez et al., 2007). However, much of the work on NER copes only with non-nested entities which are also called flat entities and neglects nested entities. This leads to loss of potentially important information, with negative impacts on subsequent tasks.
Traditional approaches to NER mainly involve two types of approaches: supervised learning (Ling and Weld, 2012;Marcińczuk, 2015;Leaman and Lu, 2016) and hybrid approaches (Bhasuran et al., 2016;Rocktäschel et al., 2012;Leaman et al., 2015) that combine supervised learning with rules. Such approaches require either domain knowledge or heavy featureengineering. Recent advances in neural networks enable NER without depending on external knowledge resources through automated learning highlevel and abstract features from text (Lample et al., 2016;Ma and Hovy, 2016;Pahuja et al., 2017;Strubell et al., 2017).
In this paper, we propose a novel dynamic neural model for nested entity recognition, without relying on any external knowledge resources or linguistics features. Our model enables sequentially  stacking flat NER layers from bottom to up and identifying entities in an end-to-end manner. The number of stacked layers depends on the level of entity nesting and dynamically adjusts to the input sequences as the nested level varies from different sequences.
Given a sequence of words, our model first represents each word using a low-dimensional vector concatenated from its corresponding word and character sequence embeddings. Taking the sequence of the word representation as input, our flat NER layer enables capturing context representation by a long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) layer. The context representation is then fed to a CRF layer for label prediction. Subsequently, the context representation from the LSTM layer is merged to build representation for each detected entity, which is used as the input for the next flat NER layer. Our model stops detecting entities if no entities are predicted by the current flat NER layer. Through stacking flat NER layers in order, we are able to extract entities from inside to outside with sharing parameters among the different LSTM layers and CRF layers.
We gain 3.9 and 9.1 percentage point improvements regarding F-score over the state-of-the-art feature-based model on two nested entity corpora: GENIA (Kim et al., 2003) and ACE2005 (Walker et al., 2006), and analyze contributions of inner entities to outer entity detection, drawing several key conclusions.
In addition, experiments are conducted on a flatly annotated corpora JNLPBA (Kim et al., 2004). Our model can be a complete NER model as well for flat entities, on the condition that it is trained on annotations that do not account for nested entities. We obtain 75.55% in terms of Fscore that is comparable to the state-of-the-art performance.

Neural Layered Model
Our nested NER model is designed based on a sequential stack of flat NER layers that detects nested entities in an end-to-end manner. Figure 2 provides the overview of our model. Our flat NER layers are inspired by the state-of-theart model proposed in Lample et al. (2016). The layer utilizes one single bidirectional LSTM layer to represent word sequences and predict flat entities by putting one single CRF layer on top of the LSTM layer. Therefore, we refer to our model as Layered-BiLSTM-CRF model. If any entities are predicted, a new flat NER layer is introduced and the word sequence representation of each detected entity by the current flat NER layer is merged to compose a representation for the entity, which is then passed on to the new flat NER layer as its input. Otherwise, the model terminates stacking and hence finishes entity detection.
In this section, we provide a brief description of the model architecture: the flat NER layers and their stacking, the embedding layer and their training.

Flat NER layer
A flat NER layer consists of an LSTM layer and a CRF layer. The LSTM layer captures the bidi-rectional context representation of sequences and subsequently feeds it to the CRF layer to globally decode label sequences.
LSTM is a variant of recurrent neural networks (RNNs) (Goller and Kuchler, 1996) that incorporates a memory cell to remember the past information for a long period of time. This enables capturing long dependencies, thus reducing the gradient vanishing/explosion problem existing in RNNs. We employ bidirectional LSTM with no peephole connection. We refer the readers to Hochreiter and Schmidhuber (1997) for more details of LSTM used in our work.
CRFs are used to globally predict label sequences for any given sequences. Given an input sequence X = (x 1 , x 2 , . . . , x n ) which is the output from the LSTM layer, we maximize the logprobability during training. In decoding, we set transition costs between illegal transitions, e.g., transition from O to I-PER, as infinite to restrict illegal labels. The expected label sequence y = (y 1 , y 2 , . . . , y n ) is predicted based on maximum scores in decoding.

Stacking flat NER layers
We stack a flat NER layer on the top of the current flat NER layer, aiming to extract outer entities. Concretely, we merge and average current context representation of the regions composed in the detected entities, as described in the following equation: where z i denotes the representation of the i-th word from the flat NER layer, and m i is the merged representation for an entity. The region starts from a position start and ends at a position end of the sequence. This merged representation of detected entities allows us to treat each detected entity as a single token, and hence we are able to make the most of inner entity information to encourage outer entity recognition. If the region is detected as a non-entity, we keep the representation without any processing. The processed context representation of the flat NER layer is used as the input for the next flat NER layer.

Embedding layer
The input for the first NER layer is different from the remaining flat NER layers since the first layer has no previous layers. We thus represent each word by concatenating character sequence embeddings and word embeddings for the first flat NER layer. Figure 3 describes the architecture of the embedding layer to produce word representation. Following the successes of Ma and Hovy (2016) and Lample et al. (2016) in utilizing character embeddings on the flat NER task, we also represent each word with its character sequence to capture the orthographic and morphological features of the word. Each character is mapped to a randomly initialized vector through a character lookup table. We feed the character vectors comprising a word to a bidirectional LSTM layer and concatenate the forward and backward representation to obtain the word-level embedding.
Differently from the character sequence embeddings, the pretrained word embeddings are used to initialize word embeddings. When evaluating or applying the model, words that are outside of the pretrained embeddings and training dataset are mapped to an unknown (UNK) embedding, which is randomly initialized during training. To train the UNK embedding, we replace words whose frequency is 1 in the training dataset with the UNK embedding with a probability 0.5.

Training
We prepare the gold labels based on the conventional BIO (Beginning, Inside, Out of entities) tagging scheme to represent a label attached to each word.
As our model detects entities from inside to outside, we keep the same order in preparing the gold labels for each word sequence. We call it the detection order rule. Meantime, we define that each entity region in the sequence can only be tagged once with the same entity type, referred to as the non-duplicate rule. For instance, in Figure 2, "interleukin-2" is tagged first while "interleukin-2 receptor alpha gene" is subsequently tagged following the above two rules. When assigning the label O to non-entity regions, we only follow the detection order rule. With these rules, the number of labels for each word equals the nested level of entities in the given word sequence.
We employ mini-batch training and update the model parameters using back-propagation through time (BPTT) (Werbos, 1990) with Adam (Kingma and Ba, 2014). The model parameters include weights, bias, transition costs, and embeddings of characters. We disable updating the word embeddings. 2 During training, early stopping, L2regularization and dropout (Hinton et al., 2012) are used to prevent overfitting. Dropout is employed to the input of each flat NER layer. Hyperparameters including batch size, number of hidden units in LSTM, character dimensions, dropout rate, Adam learning rate, gradient clipping and weight decay (L2) are all tuned with Bayesian optimization (Snoek et al., 2012).

Evaluation Settings
We employed three datasets for evaluation: GE-NIA 3 (Kim et al., 2003), ACE2005 4 (Walker et al., 2006) and JNLPBA 5 (Kim et al., 2004). We briefly explain the data and task settings and then introduce model and experimental settings.

Data and Task Settings
We performed nested entity extraction experiments on GENIA and ACE2005 while we con-ducted flat entity extraction on the JNLPBA dataset. For the details of data statistics and preprocessing, please refer to the supplementary materials.
GENIA involves 36 fine-grained entity categories among total 2,000 MEDLINE abstracts. Following the same task settings as in Finkel and Manning (2009) and Lu and Roth (2015), we collapsed all DNA subcategories as DNA. The same setting was applied to RNA, protein, cell line and cell type categories. We used same test portion as Finkel and Manning (2009), Lu and Roth (2015) and Muis and Lu (2017) for the direct comparison.
ACE2005 contains 7 fine-grained entity categories. We made same modifications described in Lu and Roth (2015) and Muis and Lu (2017) by keeping files from bn, bw, nw and wl and spitting them into training, development and testing datasets at random following same ratio 8:1:1, respectively.
JNLPBA defines both training and testing datasets. These two datasets are composed of 2,000 and 404 MEDLINE abstracts, respectively. JNLPBA is originally from the GENIA corpus. However, only flat and topmost entities in JNLPBA are kept while nested and discontinuous entities are removed. Like our preprocessing on the GENIA corpus, subcategories are collapsed and only 5 entity types are finally reserved. We randomly chose the 90% sentences of the original training dataset as our training dataset and the remaining as our development dataset.
Precision (P), recall (R) and F-score (F) were used for the evaluation metrics in our tasks. We define that if the numbers of gold entities and predictions are all zeros, the evaluation metrics all equal one hundred percent.

Model and Experimental Settings
Our model was implemented with Chainer 6 (Tokui et al., 2015). We initialized word embeddings in GENIA and JNLPBA with the pretrained embeddings trained on MEDLINE abstracts (Chiu et al., 2016). For ACE2005, we initialized each word with the pretrained embeddings which are trained by Miwa and Bansal (2016). Except for the word embeddings, parameters of word embeddings were initialized with a normal distribution. For LSTM, we initialized hidden states, cell state and all the bias terms as 0 except for the forget gate bias that was set as 1. For other hyper-parameters, we chose the best hyper-parameters via Bayesian optimization. We refer the readers to the supplemental material for the settings of the hyperparameters of the models and Bayesian optimization.
For ablation tests, we compared with our layered-BiLSTM-CRF model with two models that produce the input for next flat NER layer in different ways. The first model is called layered-BiLSTM-CRF w/o layered out-of-entities which uses the input of the current flat NER layer for out-of-entity words. We name the second model as layered-BiLSTM-CRF w/o layered LSTM as it skips all intermediate LSTM layers and only uses output of embedding layer to build the input for the next flat NER layer. Please refer to supplemental material for the introduced two models. 7 To investigate the effectiveness of our model on different nested levels of entities, we evaluated the model performance on each flat NER layer on GE-NIA and ACE2005 test datasets. 8 When calculating precision and recall measurements, we collected the predictions and gold entities from the corresponding flat NER layer. Since predicted entities on a specific flat NER layer might be from other flat NER layers, we defined extended precision (EP), extended recall (ER) and extended Fscore (EF) to measure the performance. We calculated EP by comparing the predicted entities in a specific flat NER layer with all the gold entities, and ER by comparing the gold entities in a specific flat NER layer with all the predicted entities. EF was calculated in the same way with F.
In addition to experiments on nested GENIA and ACE2005 datasets, flat entity recognition was conducted on the JNLPBA dataset. We trained our flat model that only kept the first flat NER layer and removed the following stacking layers. We follow the hyper-parameters settings by Lample et al. (2016) for this evaluation. 7 We examined the contributions of predicted labels of the current flat NER layer to the next flat NER layer. For this, we introduced label embeddings into each test by combining the embedding with context representation. Experiments show that appending label embedding hurts the performance of our model while gain slight improvements in the rest 2 models on development datasets. 8 We removed entities which were predicted in previous flat NER layers during evaluation.
4 Results and Analysis 4.1 Nested NER Table 1 presents the comparisons of our model with related work including the state-of-the-art feature-based model by Muis and Lu (2017). Our model outperforms the state-of-the-art models with 74.7% and 72.2% in terms of F-score, achieving the new state-of-the-art in the nested NER tasks. For GENIA, our model gained more improvement in terms of recall with enabling extract more nested entities without reducing precision. On ACE2005, we improved recall with 12.2 percentage points and obtained 5.1% relative error reductions. Compared with GENIA, our model gained more improvements in ACE2005 in terms of F-score. Two possible reasons account for it. One reason is that ACE2005 contains more deeper nested entities (maximum nested level is 5) than GENIA (maximum nested level is 3) on the test dataset. This allows our model to capture the potentially 'nested' relations among nested entities. The other reason is that ACE2005 has more nested entities (37.45%) compared with GENIA (21.56%). Table 2 shows the results of models on the development datasets of GENIA and ACE2005, respectively. From this table, we can see that our model, which only utilizes context representation for preparation of input for the next flat NER layer, performs better than the rest two models. This demonstrates that introducing input of the current flat NER layer such as skipping either representation for any non-entity or words or all intermediate LSTM layers hurts performance. Compared with the layered-BiLSTM-CRF model, the drop of the performance in the layered-BiLSTM-CRF w/o layered out-of-entities model reflects the skip of representation for out-of-entity words leads to the decline in performance. This is because the representation of non-entity words didn't incorporate the current context representation as we used the input rather than the output to represent them. By analogy, the layered BiLSTM-CRF w/o layer LSTM model skips representation for both entities and non-entity words, resulting in worse performance. This is because, when skipping all intermediate LSTM layers, input of the first flat NER layer, i.e., word embeddings, is passed to the remaining flat NER layers. Since word embeddings do not contain context representation, we fail to incorporate the context representation when we use  (2009)     the word embeddings as the input for the flat NER layers. Therefore, we have no chance to take advantage of the context representation and instead we only manage to use the word embeddings as the input for flat NER layers in this case. Table 3 and Table 4 describe the performance for each entity type in GENIA and ACE2005 test datasets, respectively. In GENIA, our model performed best in recognizing entities with type RNA. This is because most of the entities pertaining to RNA mainly end up either with "mRNA" or RNA. These two words are informative indicators of RNA entities. For entities in rest entity types, their performances are close to the overall performance. One possible reason is that there are many instances to model them. This also accounts for the high performances of entity types such as PER, GPE in ACE2005. The small amounts of instances of entity types like FAC in ACE2005 is one reason for their under overall performances. We refer readers to supplemental material for statistics details. When evaluating our model on top level which contains only outermost entities, the precision, recall and F-score were 78.19%, 75.17% and 76.65% on GENIA test dataset. For ACE2005, the corresponding precision, recall and F-score were 68.37%, 68.57% and 68.47%. Compared with the overall performance listed in Table 1, we obtained higher top level performance on GENIA but lower performance in ACE2005. We discuss details of this phenomena in the following tables. Table 5 shows the performances of each flat NER layer in GENIA test dataset. Among all the stacking flat NER layers, our model resulted in the best performance regarding standard evaluation metrics on the first flat NER layer which contains the predictions for the gold innermost entities. When the model went to deeper flat NER layers, the performance dropped gradually as the number of gold entities decreased. However, the performance for predictions on each flat Layer P (%) R (%) F (%) EP (%) ER (%) EF (%) #Predictions   NER layer was different in terms of extended evaluation metrics. For the first two flat NER layers, performance of extended evaluation is better than the performance of standard evaluation. It indicates that gold entities correspond to some of the predictions on the specific flat NER layer are from other flat NER layers. This may lead to the zero performances for the last flat NER layer. In addition, performance on the second flat NER layer was higher than it was on the first flat NER layer in terms of extended F-score. This demonstrates that our model is able to obtain higher performance on top level of entities than innermost entities. Table 6 lists the results of each flat NER layer on ACE2005 test dataset. Similar to GENIA, the first flat NER layer achieved better performance than the rest flat NER layers. Performances decreased in a bottom-to-up manner regarding model architecture. This phenomena was the same with the extended evaluation performances, which reflects that some of the predictions in a specific flat NER layer were detected in other flat NER layers. Unlike rising tendency (except last flat NER layer) regarding extend F-score in GENIA, performance in ACE2005 was in downtrend. This accounts for the fact that F-score on top level was lower than it on the fist flat NER layer. Even though the decline trend in extended F-score, the first flat NER layer contained the largest proportion of predictions for the gold entities, the overall performance on all nested entities showed in Table 1 was still high. Unlike GENIA, our model in ACE2005 stopped before reaching the maximum nested level of entities. It indicates our model failed to model the appropriate nested levels. This is one of the reasons that account for the zero predictions on the last flat NER layer. One reason is that our model The sparse instances on the high nested levels could be another reason that resulted in the zero performances on the last flat NER layer.

Flat NER
Compared with the state-of-the-art work on JNLPBA (Gridach, 2017) which achieved 75.87% in terms of F-score, our model obtained 75.55% in F-score. Since both the model by Gridach (2017) and our flat model are based on Lample et al. (2016), so it is reasonable that both models were able to get comparable performance.

Error analysis
We showed the error types and their statistics both for all nested entities and each flat NER layer on GENIA and ACE2005 test datasets. From ACE2005 test dataset, 28% of predictions were incorrect in 200 sentences which were selected at random. Among these errors, 39% of them were because their text spans were assigned with other entity types. We call this type of errors type error. The main reason is that most of them are pronouns and co-refer to other entities which are absent in the sentence. Taking this sentence "whether that is true now, we can not say" as an example, "we" is annotated as ORG while our model labeled it as PER. Lack of context information such as the absence of co-referent entities leads our model to make the wrong decisions. In addition, 30% of the errors were caused by that incorrect predictions were predicted as only parts of gold entities with correct entity types. This error type is referred to as partial prediction error. This might be due to these gold entities tend to clauses or inde-pendent sentences, thus possibly containing many modifiers. For example, in this sentence "A man who has been to Baghdad many times and can tell us with great knowledge exactly what it's going to be like to fight on those avenues in that sprawling city of Baghdad -Judy .", "A man who has been to Baghdad many times and can tell us with great knowledge exactly what it's going to be like to fight on those avenues in that sprawling city of Baghdad" is annotated as PER while our model could only extract "A man who has been to Baghdad many times" and predicted it as PER.
Errors on the first flat NER layer, we got 41% in type error and 11% of partial prediction error, respectively. Apart from this, our model recognized predictions from other flat NER layers, leading to 5% errors. We define this error type as layer error. Unlike the first flat NER layer, 26% of errors were caused by layer error. Additionally, 17% of the errors belong to type error. In particular, 22% errors were due to the type error. As for the last flat NER layer, 40% errors were caused by partial prediction error. The rest errors were different from the mentioned error types. One possible reason is that we have less gold entities to train this flat NER layer compared with previous flat NER layers. Another reason might be the error propagation.
Similarly, 200 sentences were randomly selected from GENIA test dataset. We got 20% errors of predictions in the subset. Among these errors, 17% and 24% of errors were separately due to type error and partial prediction error. In addition, 24% of the predictions on the first flat NER layer were incorrect. Among them, the top error types were layer error, partial prediction error and type error, accounting for 21%, 18% and 13%, respectively. Errors on the second flat NER layer were mainly caused by type error and the and partial prediction error.

Related Work
The success of neural networks has boosted the performance of flat named NER in different domains (Lample et al., 2016;Ma and Hovy, 2016;Gridach, 2017;Strubell et al., 2017). Such models achieved the state of the art without any handcrafted features and external knowledge resources.
Contrary to flat NER, much fewer attempts have emphasized the nested entity recognition. Existing approaches to nested NER (Shen et al., 2003;Alex et al., 2007;Finkel and Manning, 2009;Lu and Roth, 2015;Xu and Jiang, 2016;Muis and Lu, 2017) mainly rely on hand-crafted features. They also failed to take advantage of the dependencies among nested entities. Our model enables capturing dependencies and automatic learning highlevel abstract features from texts.
Early work regarding nested NER involve mainly hybrid systems that combined rules with supervised learning algorithms. For example, Shen et al. (2003),  and  employed a Hidden Markov Model to GENIA to extract inner entities and then used rule-based methods to obtain the outer entities. Furthermore, Gu (2006) extracted nested entities based on SVM which were trained separately on both inner entities and outermost entities without putting the hidden relations between nested entities into consideration. All these methods failed to capture the dependencies between nested entities. One trial work is that Alex et al. (2007) separately built a inside-out and outside-in layered CRFs which were able to use the current guesses as the input for next layer. They also cascaded separate CRFs of each entity type by using output from previous CRFs as features of current CRFs, yielding best performance in their work. One of the main drawbacks in the cascading approach was that it failed to handle nested entities sharing the same entity type, which were quite common in natural languages. Finkel and Manning (2009) proposed a discriminative constituency tree to represent each sentence where the root node was used for connection. All entities were treated as phrases and represented as subtrees following the whole tree structure. Unlike our linguistic features independent model, Finkel and Manning (2009) used a CRFbased approach driven by entity-level features to detect nested entities Later on, Lu and Roth (2015) built hyper-graphs that allow edges to connect multiple nodes to represent both the nested entities and their references (a.k.a. mentions). One issue in their approach is the spurious structures of hyper-graphs as they enumerate combinations of nodes, types and boundaries to represent entities. In addition, they fail to encode the dependencies among embedded entities using hyper-graphs. In contrast, our model enables nested entity representation by merging representation of multiple tokens composed in the entity and considers it as the longer entity representation. This allows us to represent outer entities based on inner entity representation, thus managing to capture the relations between inner and outer entities, and hence overcoming the spurious entity structure problem.
As an improvement in overcoming spurious structure issue in Lu and Roth (2015), Muis and Lu (2017) further incorporated mention separators along with features to yield better performance on nested entities. Both Lu and Roth (2015) and Muis and Lu (2017) rely on hand-crafted features to extract nested entities without incorporating hidden dependencies in nested entities. In contrast, we make the most of dependencies of nested entities in our model to encourage outer entity recognition by automatic learning of high-level and abstract features from sequences.
Shared tasks dealing with nested entities like SemEval-2007 Task 9 9 and GermEval-2014 10 were held in order to advance the state-of-theart on this issue. Additionally, as subtasks in KBP 2015 11 and KBP 2016 12 , one of the aims in tri-lingual Entity Discovery and Linking Track (EDL) track was extracting nested entities from textual documents varying from English, Chinese and Spanish. Following this task, Xu and Jiang (2016) firstly developed a new tagging scheme which is based on fixed-size ordinally-forgetting encoding (FOFE) method for text fragment representation. All the entities along their contexts were represented using this novel tagging scheme. Different from the extensively used LSTM-RNNs in sequence labeling task, a feed-forward neural network was used to predict labels on entity level for each fragment in any of given sequences. Additionally, Li et al. (2017) used the model proposed in Lample et al. (2016) to the extract both flat entities and components composed in nested and discontinuous entities. Another BiLSTM was applied to combine the components to get nested and discontinuous entities. However, these methods failed to capture and utilize the inner entity representation to facilitate outer entity detection.

Conclusion
This paper presented a dynamic layered model which takes full advantage of inner entity information to encourage outer entity recognition in an end-to-end manner. Our model is based on a flat NER layer consisting of LSTM and CRF, so our model is able to capture context representation of input sequences and globally decode predicted labels at a flat NER layer without relying on feature-engineering. Our model automatically stacks the flat NER layers with sharing the parameters of LSTM and CRF in the layers. The stacking continues until the current flat NER layer predicts sequences as all outside of entities, which enables stopping dynamically stacked flat NER layers. Each flat NER layer receives the merged context representation as input for outer entity recognition, based on the predicted entities from the previous flat NER layer. With this dynamic endto-end model, our model is able to outperform existing models, achieving the-state-of-art on two nested NER tasks. In addition, the model can be flexibly simplified as a flat NER model by removing components cascaded after the first NER layer.
Extensive evaluation shows that utilization of inner entities significantly encourages outer entities detection with improvements of 3.9 and 9.1 percentage points in F-score on GENIA and ACE2005, respectively. Additionally, utilization of only current context representation contributes to the performance improvement than use of context representation from multi-layers.

Hyper Parameters
Initialized Value Acquisition Function gp hedge n-calls 10 n random state None n random starts 10 Acquisition Optimizer lbfgs n restarts optimizer 100 noise gaussian n points 50000 xi 0.1 n jobs 1 is an connection indicator of the separated texts in discontinuous entities. Meanwhile, each of the separated texts has no 'sem' attribute unless itself is an innermost entity. Unfortunately, there are some inconsistent cases such as "c-fos and c-jun transcripts" where symbol * should be in the 'lex' attribute as the discontinuous entity "c-fos transcript" is connected by "c-fos" and "transcript" while "c-jun transcript" is connected by "c-jun" and "transcript". These two entities share the same text "transcript". However, each of them is annotated with two attributes: 'lex' and 'sem', following the same annotation for flat entities. Although it is possible to ignore the latter entity based on 'lex' attribute and its belonging sentence, this rule fails to deal with entity "c-jun gene" in the example of "c-fos and c-jun genes" as the 'lex' of "c-jun gene" is mistaken as "c-jun genes". Therefore, in this case, we ignored "c-fos transcript" and instead kept the "c-jun transcripts" as a flat entity. Another issue is the incomplete tokenization. The label assignment to one word was conducted on the word-level instead of character level, but there are entities that correspond to parts of words. An example is "NF-YA subunit", which contains two protein entities: "NF-Y" and "A subunit". To cope with this problem, we treat both two entities as false negative entities in training dataset as there are only 13 such entities in the training data set.

B Bayesian Optimization Setting
The hyper-parameters which were tuned for our model are listed in Table 10 and Table 11. These hyper-parameters are tuned by Bayesian optimization with the hyper parameters listed in Table 12. C Model Structure Figure 4 shows the model architecture when we skip all intermediate LSTM layers and only word embeddings are used to produce the input for the next flat NER layer. Figure 5 describes the model architecture when we skip the representation of non-entity words to prepare the input for the next flat NER layer. Concretely, we merge and average representation following Equation 1. For the predicted non-entity words, however, we skip the LSTM layer and directly use their corresponding representation from the input rather than the output context representation.  Figure 4: Overview of the model architecture with skipping representation for non-entity words. "interleukin-2" and "interleukin-2 receptor alpha gene" are nested entities.  Figure 5: Overview of the model architecture with skipping representation for whole sequence. "interleukin-2" and "interleukin-2 receptor alpha gene" are nested entities.