NITE: A Neural Inductive Teaching Framework for Domain Specific NER

In domain-specific NER, due to insufficient labeled training data, deep models usually fail to behave normally. In this paper, we proposed a novel Neural Inductive TEaching framework (NITE) to transfer knowledge from existing domain-specific NER models into an arbitrary deep neural network in a teacher-student training manner. NITE is a general framework that builds upon transfer learning and multiple instance learning, which collaboratively not only transfers knowledge to a deep student network but also reduces the noise from teachers. NITE can help deep learning methods to effectively utilize existing resources (i.e., models, labeled and unlabeled data) in a small domain. The experiment resulted on Disease NER proved that without using any labeled data, NITE can significantly boost the performance of a CNN-bidirectional LSTM-CRF NER neural network nearly over 30% in terms of F1-score.


Introduction
Domain-specific Named Entity Recognition (DNER), which aims to identify domain specific entity mentions and their categories, plays an important role in domain document classification, retrieval and content analysis. It is also a foundation for further level of complex information extraction tasks, serves as cornerstone in the knowledge computing process of transforming data into machine readable knowledge . Domain-specific NER is a challenging problem. For example, in biomedical domain, the number of unseen biomedical entity mentions (such as disease names, chemical names), their abbreviations or acronyms, as well as multiple names of the same entity is growing fast with the rapid increase of biomedical literatures and clinical records. However, the performance of a learning based NER system relies heavily on data annotation, which is quite expensive. The situation is even worse in domain-specific NER systems, since their data annotation requires the engage of domain experts. Therefore, in many special domains, only trained models or APIs are available, while their training data are private and inaccessible. On the other hand, due to insufficient labeled training data, deep models usually fail to behave normally in such domain, and state-of-the-art methods in these domains are usually dominated by rule based deductive methods or shallow model with hand-crafted features. However, the way of pre-defining useful domain specific hand-crafted features or rules are usually unavailable to the public.
In this paper, we proposed a novel Neural Inductive TEaching framework (NITE) to transfer knowledge from existing models into an arbitrary deep neural network. The idea of NITE is mainly borrowed from Transfer learning (Pan and Yang, 2010) where previously learned knowledge can aid current situation and solve problems with better solutions. In NITE, existing NER models behave like inefficient teachers to teach a deep neural network (we called student network) to identify named entities by giving it concrete examples. The knowledge transferred from these models is their posterior distributions on unlabeled data. These teachers are inefficient because they transfer not only useful information, but also errors to the student. The inputs of student network can be twofold, one is a small proportion from human labeled ground truth data (optional, like text book), and another is a large proportion from teachers, which is always noisy and less trustable.
In such case, a student is overwhelmed and often inferior to the teachers, therefore in NITE, we introduced Multiple Instance Learning (MIL) trick (Dietterich et al., 1997;Babenko, 2008) to reduce the input noise during the model training.
In summary, NITE is a general framework that can help deep learning methods to make the best use of existing resources (i.e., models, labeled and unlabeled data). The experiment results on Disease NER (DNER) proved that without using any labeled data, NITE can significantly boost the performance of a CNN-bidirectional LSTM-CRF NER neural network (Ma and Hovy, 2016), which trained on NCBI training dataset nearly over 30% in terms of F1-score. It also outperformed the teacher model, which proved the correctness of our hypothesis.

Neural Inductive Teaching Framework
In this section we will define our NITE framework step by step, and apply it to Disease NER.

Inductive Teaching
Inductive teaching means teaching student by examples, our inductive teaching method builds upon teacher-student models (Ba and Caruana, 2014) and knowledge distillation (Hinton et al., 2015). The main idea of our method is to transfer discriminative knowledge from well-trained existing models (teachers) to a new and more capable model (student). The student learns by imitating the teachers' behaviors, and the teaching process can be defined as follows: Let x = {w 1 , w 2 , . . . , w |x| } be an input sentence of |x| words, where w k is the kth word in x. If l k is the corresponding 3-dimensional one hot IOB (In-Out-Begin) vector for w k , then the NER labeling sequence of x can be defined as For a given sentence x i , we further define the posterior distribution of a teacher as y ft i = f t (y i |x i ), while the posterior distribution of a student network can be defined as y fs i = f s (y i |x i ; θ), where θ is the parameters of the student network. During training, we measure the similarity between y ft and y fs with KL-divergence, and minimize their difference. Therefore, for a given x i , we optimize: This equation can be optimized through stochastic gradient descent over shuffled mini-batches with the Adadelta (Zeiler, 2012) update rule.

Multiple Instance Learning
Multiple Instance Learning is an effective training method that can help to train a supervised model to alleviate the wrong label problem (Riedel et al., 2010;Hoffmann et al., 2011;Surdeanu et al., 2012). Instead of predicting labels for each individual training sample, the objective of MIL is to predict the labels (positive or negative) of the unseen bags, where each bag contains a fixed number of instances (samples). The standard MIL assumption assumes that a bag is positively labeled if at least one instance in a bag is positive, and is negatively labeled if all instances in a bag are negative. MIL is generally used in training a binary classifier, to apply MIL in NITE, we redefine the label of a bag as the quality (correctness) of its containing samples. Thus, in NITE, a bag is positively labeled if at least one instance in it is labeled correctly. Furthermore, it is inappropriate to evaluate the correctness of IOB label (i.e., l k ) of each word (i.e., w k ), since the IOB sequence y i of a sentence x i is generated dependently. Therefore, we choose sentence x i as our MIL instance, and the correctness of x i is evaluated by the likelihood probability of all words with correct BIO tags. In general, our MIL can be formally defined as follows: Randomly allocate training samples in a minibatch B into M bags, i.e., B = {B 1 , B 2 , . . . , B M } with their corresponding labels {z 1 , z 2 , . . . , z M }, where z m ∈ {−1, 1}. For bag B m , it contains K instances, i.e., B m = {x 1 , x 2 , . . . , x K }, where x i is a sentence with its posterior evaluation y fs i . During the training, given a bag B m , if z m = 1, which means B m is a positive bag. In order to reduce the noise, our MIL learner will select the most correct instance y fs i * , which has the maximum likelihood among all other instances (i.e., sentence) in the bag B m . That is P (z m = 1|B m ) = P (y fs i * ) = arg max If z m = −1, which means B m is a negative bag, in order to better detect such negative bags, our MIL learner should select the most violated instance for learning, which is also the instance with maximum likelihood. Thus, the bag label z (which indicates the sentence is labeled correctly or incorrectly) is actually integrated out, since no matter what the value z is, MIL in NITE will always select the instance with the highest likelihood probability. Finally the MIL in NITE can be summarized as: (2) In summary, MIL in NITE can be regarded as a mechanism for posterior selection, or regularization on posterior distribution of a student network. Therefore, MIL only affects the model training, and it will not affect the testing process.

Teacher Model & Student Network
Theoretically, the teacher model of NITE can be any existing well-trained model, while the student network can be an arbitrary deep neural network. In this paper, we focus on domain-specific NER, and more specifically on Disease NER, which is a small but typical domain that is suffering from insufficient labeled training data.
There are many existing DNER systems, and the most well-known systems are BANNER (Leaman et al., 2008), and DNorm (Leaman et al., 2013). BANNER is an open-source biomedical NER system implemented using conditional random fields (CRFs) (Lafferty et al., 2001). While, DNorm uses supervised semantic indexing, is trained with pairwise learning to rank, to score the mentions returned by BANNER. Therefore, DNorm can be regarded as an extension of BAN-NER, and the whole system depends on handcrafted features such as word spelling features and orthographic features. DNorm is the state-of-theart DNER system, and therefore we adopt DNorm as our teacher model.
For the student network, we are looking for state-of-the-art solutions in general NER. There are many studies on applying complex deep learning models on general NER or other sequence labeling tasks. Without any feature engineering trick, deep models have achieved comparable or better performances than many other traditional methods. More recently, Ma and Hovy (2016) proposed a method that concatenated CNN, bidirectional LSTM, and CRF successively to form an end to end deep NER model (CLC for short). CLC achieved state-of-the-art performance in general NER, and therefore we take the CLC as our student network, Fig. 1 shows the overall architecture of our student network. As shown in Fig. 1, the character-level embeddings are generated by CNN layers, then are concatenated with pre-trained word embeddings, and finally fed into the bidirectional LSTM layer. The bidirectional LSTM is efficient to capture syntactic and semantic information both preceding and following simultaneously. Its output vectors are fed into the CRFs layer for IOB sequence labeling. It uses maximum conditional likelihood estimation to choose parameters during the finally CRFs training process, and its likelihood can be given as follows: , where Y(x i ) denotes the set of possible label sequences for x i . Eq. 3 can be solved efficiently by adopting the Viterbi algorithm. Fig. 2 shows the whole NITE-NER training process. For each training iteration, training samples in a mini-batch are randomly allocated into M bags, and then fed into the student network f s . For bag B m , the student network will generate posterior evaluation y fs i for each input instance x i ∈ B m respectively. Then the MIL module will select the best sample y ft i * from all K instances according to Eq. 3 and 2. Finally, NITE will retrieve posterior evaluation y ft i * from the teacher, and update θ based on Eq. 1.

Experiments
In this section we designed several experiments to testify our hypothesis of inductive teaching as well as evaluate our NITE framework.

Training Corpus
Although NITE is a supervised learning framework, the discriminative knowledge of student net- (1). input bag (7). update work is learned indirectly from the teacher models, therefore NITE can be trained without any labeled data.
To evaluate the efficiency of the NITE framework, we trained two DNER models on NCBI disease corpus (Dogan et al., 2014;Islamaj Dogan and Lu, 2012). One is the well-known DNorm model, which is the state-of-the-art method in disease NER. Another one is the bi-directional LSTM-CNN-CRF NER neural network i.e., CLC (Ma and Hovy, 2016), which has the state-of-theart performance in general NER task. The CLC architecture also serves as our student network.
The NCBI disease corpus is a widely used data corpus with disease name and related concept annotations in biomedical research field. The corpus is an extension of the AZDC corpus (Leaman et al., 2009) which was annotated only with disease mentions. The detailed characteristics of the NCBI disease corpus as well as how we partition the data are shown in Table 1.

Experiment Setup
The experiment's setup is as follows: Our NITE-DNER is trained without any labeled data, we randomly sampled 2,000 unlabeled abstracts of biomedical literature from PubMed as our training data. The DNorm model is served as the teacher model in the NITE framework.
In student network, we initialized character embeddings with uniform samples from where we set the dimension d =  In training procedure we set initial learning rate η 0 = 0.015 with decay rate ρ = 0.05, the learning rate is updated as η t = η 0 /(1.0 + ρn), where n is the number of epochs. We use a fixed dropout rate 0.5 at CNN and both input and output vectors of bi-directional LSTM to mitigate overfitting. For MIL we set the bag size K = 5 with mini-batch size 30. We implemented neural networks on a GeForce GTX 1080 using Theano.

Results and Discussion
We evaluated all three DNER methods on the NCBI test set in terms of precision, recall and F1score. All the measurements are based on exact location of extracted disease mentions in the given test sentences.  The experiment results are presented in Table 2. As shown in Table 2, although the complex CLC network is the state-of-the-art method in general NER, it behaves poorly in domain-specific NER task due to insufficient labeled training data. However, with the help of our NITE framework, its performance is significantly boosted, and reached the comparable level of DNorm. This proved that knowledge transfer in NITE is efficient and important in training a deep model of domain-specific NER.

Conclusion
In this paper, we proposed a general framework, NITE, and demonstrated its efficiency in transferring DNER knowledge into an end to end deep NER model. Although we only proposed a solution for DNER, it could be easily applied to other domain-specific NER problems (e.g., chemical, gene, and protein) or even applications other than NER. The experiment results suggested that NITE can be very helpful on training a deep model when other resources are available. For future work, a NITE architecture with more than one teacher could be considered. Moreover, as mentioned in (Zhou et al., 2017), crowd knowledge can be used to reshape deep learning features. Our framework can also incorporate crowd knowledge easily, in which the teachers can be human crowds, and then the NITE can employs active learning (Olsson, 2009) or lifelong machine learning (Chen and Liu, 2016) to progressively polishing the student model.