Connecting Distant Entities with Induction through Conditional Random Fields for Named Entity Recognition: Precursor-Induced CRF

This paper presents a method of designing specific high-order dependency factor on the linear chain conditional random fields (CRFs) for named entity recognition (NER). Named entities tend to be separated from each other by multiple outside tokens in a text, and thus the first-order CRF, as well as the second-order CRF, may innately lose transition information between distant named entities. The proposed design uses outside label in NER as a transmission medium of precedent entity information on the CRF. Then, empirical results apparently demonstrate that it is possible to exploit long-distance label dependency in the original first-order linear chain CRF structure upon NER while reducing computational loss rather than in the second-order CRF.

One of the primary advantages of applying the CRF to language processing is that it learns transition factors between hidden variables corresponding to the label of single word. The fundamental assumption of the model is that the current hidden state is conditioned on present observation as well as the previous state. For example, a part-ofspeech (POS) tag depends on the word itself, as well as the POS tag transitions from the previous word. In the problem, the POS tags are adjacent to each other in a text forming a tag sequence; therefore, the sequence labeling model can fully capture dependencies between labels.
In contrast, a CRF in named entity recognition (NER) cannot fully capture dependencies between named entity (NE) labels. According to Ratinov & Roth (2009), named entities in a text are separated by successive "outside tokens" (i.e., words that are non-named entities syntactically linking two NEs) and considerable number of NEs have a tendency to exist at a distance from each other. Therefore, high-order interdependencies of named entities between successive outside tokens are not captured by first-order or second-order transition factors.
One major issue in previous studies was concerned with the way in which to explore long-distance dependencies in NER. Only dependencies between neighbor labels are generally used in practice because conventional high-order CRFs are known to be intractable in NER (Ye, Lee, Chieu, & Wu, 2009). Previous studies have demonstrated that implementation of the higherorder CRF exploiting pre-defined label patterns leads to slight performance improvement in the conventional CRF in NER (Cuong, Ye, Lee, & Chieu, 2014;Fersini, Messina, Felici, & Roth, 2014;Sarawagi & Cohen, 2005;Ye et al., 2009). However, there are certain drawbacks associated with handling named entity transitions within arbitrary length outside tokens.
In an attempt to utilize long-distance transition information of NEs through non-named entity to-kens, this study explores the method which modifies the first-order linear-chain CRF by using the induction method.

Precursor-induced CRF
Prior to introducing the new model formulation, the following information presents the general concept of CRF. As a sequence labeling model, the conventional CRF models the conditional distribution ( | ) in which x is the input (e.g., token, word) sequence and y is the label sequence of x. A hidden state value set consists of target entity labels and a single outside label. By way of illustration, presume a set { , , } as the hidden state value set; assign or to NEs, likewise, assign to outside words. From the hidden state set, a label sequence is formed in a linear chain in NER; for example, a sequence 〈 , , ⋯ , 〉 in which successive outside words are between the two NE words. Because the first-order model assumes that state transition dependencies exist only between proximate two labels to prevent an increase in computational complexity, the first-order CRF learns bigram label transitions from the subsequence; {( , ), ( , ), ( , )} that is, label transition data learnt from the example sequence. In the example, dependency ( , ) is not captured in the model. The main purpose of the precursor-induced CRF model, introduced in this study, is to capture specific high-order named entity dependency that is an outside word sequence between two NEs. The main idea can be explained in the following manner:  It mainly focuses on beneficial use of outside label as a medium delivering dependency between separated NEs.

(a))
 Adds memory element to the hidden variables for the outside states (Figure 1(b)).
 The first outside label in an outside subsequence explicitly has a first-order dependency with its adjacent entity. If the first outside label tosses the information to the next, the information possibly flows forward.
 By induction process, the information of the first entity can flow through multiple outside labels to the second entity state (Figure 1(c)).
In the pre-induced CRF, the outside state with a memory element behaves as if an information transmission medium is delivering information about the presence or absence of the preceding entity forward. It is required to expand state set. States are collected and only entity states are selected. Multiplied outside state set is derived by multiplication of entity states and outside state. Expanded state set is consequently derived as a union of entity states and multiplied outside states.
Turning to the formulation, the conditional probability distribution of a label sequence y, given an observation x in the CRF has a form as Eq. (1), where fk is an arbitrary feature function having corresponding weight , the ( ) is a partition function, and t is time step (Sutton & McCallum, 2011). The feature function fk is generally indicator function that has value 1 only if the function is matched to a certain condition, otherwise 0. Transition factor in CRF has a form of function fij(y, y', x)=1{y=i}1{y' =j}, and observation factor has a form of a function fio(y, y', x)=1{y=i}1{x=o}. Derived from Eq.(1), conditional probability distribution of the precursor-induced CRF takes a form as Eq. (2), where the variable a is to store the induced state information, and the value of "at" is activated by the value of "at-1" and "yt" Once the "at" is activated, the "at" eventually transmutes the value of "yt." This induction process eventually expands the original label value set. It produces newly induced outside states instead of the single outside state; for example, the process modifies an original label sequence 〈 , , ⋯ , 〉 to 〈 , [ ] , ⋯ [ ] , 〉 . This transformation helps the CRF learn long-distance named entity transitions, even in the first-order form; from the modified example sequence, the model can learn label transition data {( [ ] , )} where entity depends on entity preceding itself. In terms of the number of newly produced states, when N=|States| in the original first-order CRF (a state set consists of NE states and one outside state), this procedure introduces new states. (if the IOB2 tagging scheme (Tjong & Sang, 1995) is applied, ( − 1) 2 ⁄ + 1 new states are introduced). To train the precursor-induced CRF, L-BFGS optimization method (Fei Sha & Fernando Pereira, 2003) and l2-regularization (Ng, 2004) are used as conventional first-order CRF exploits (Sutton & McCallum, 2011). Furthermore, the Viterbi algorithm is used for inference.
During training and inference, it is also required to treat the fragmented outside states as a single outside label in practice. First, a weight of an observation feature fio depends on the frequency of an observation as well as co-occurrence label data. Fragmenting a single outside state into multiple states may cause data-sparseness problems especially for observation features occurring within the fine-grained outside states in training time. To prevent the data sparseness problem derived by the precursor-induced CRF, observation factor fio(y,y',x) is customized as (1{i∈⌐Outside, y=i} + 1{i∈Outside}) 1{x=o}1{y' =1}. Second, the expected label alphabets in inference time are required to be matched to the label alphabets of given annotation. Therefore, the fragmented outside state reverts to the original outside label.

Experiments
All the experiments were performed by implementing both the original and precursor-induced CRF 1 . The activity refers to CRF implemented in MALLET (Andrew Kachites McCallum, 2002). To compare precursor-induced CRF with the original CRF in NER on the real-world clinical documents and biomedical literatures, three annotated NER corpus were used; i2b2 2012 NLP shared task data 1 https://github.com/jinsamdol/precursor-induced_CRF (Sun, Rumshisky, & Uzuner, 2013), discharge summaries of rheumatism patients at Seoul National University Hospital (SNUH), and JNLPBA 2004 Bio-Entity Recognition shared task data (Kim, Ohta, Tsuruoka, Tateisi, & Collier, 2004). The discharge summary of rheumatism patient corpus is built for this evaluation. This corpus consists of 200 electronic clinical documents where English and Korean words are jointly used for recording patient history. We used the division of training and test set provided by the i2b2 2012 and JNLPBA corpus in this evaluation. For the SNUH corpus, 10-fold cross validation was used.
Annotated named entities involved in the clinical NER evaluation are related to mentions describing the patient's history. In the i2b2 2012 corpus, problem, test, and treatment named entity classes are used. In the SNUH corpus, symptom, test, diagnosis, medication, and procedure-operation classes are used. The named entity classes in the biomedical NER evaluation are DNA, RNA, protein, cell line, and cell type.
In the i2b2 2012 training data, 9,942 entities have outside state precedence, and approximately 63.8% cases of them take a pattern 〈 , , 〉 . Likewise, in SNUH corpus, 58.9% cases of NEs having outside precedence have a preceding named entity. Median value of the distance between consecutive entities tend to be within 3-4 in the datasets. The long distance dependency is restricted within a single instance (i.e., a sentence).
To perform NER evaluation, two types of feature families are used: (a) token itself and neighbor tokens in window size 3. In addition, morphologically normalized tokens are used together. (b) morphology features such as character prefix and suffix of length 2-4. Our feature setting 1 uses the single feature family (a) and feature setting 2 simultaneously uses both of the feature family (a) and (b). The reason for setting these simple feature configurations is for the purpose of reducing bias that the feature will affect the performance comparison of the models.
In order to compare the proposed model with the conventional CRF, both the first-order and the second-order CRF are used as baseline models.
The performance comparison result is shown in the Table 1. The result shows a tendency that precursor-induced (pre-induced) CRF leads to a slight performance improvement compared to both the first-order and second-order CRFs in most cases. However, the overall improvement is small. Table 2 compares the elapsed time per iteration in parameter training for each model. The result shows that the second-order CRF takes quite more time than the first-order CRF to compute one training iteration. The pre-induced CRF takes 1.7 times more computation time than the first-order CRF in average. The pre-induced CRF takes significantly less time than the second-order CRF while the preinduced CRF exploits longer label transition dependency than the second-order CRF.
These results indicate that the precursor-induced CRF, where long-distance dependency is introduced in CRF by label induction, slightly improves the effectiveness in clinical and biomedical NER while also significantly reducing computational cost rather than building second-or higher-order CRFs.

Conclusion
The requirement utilizing high-order dependencies often holds in sequence labeling problems; however, second-order or higher-order models are considered computationally infeasible. Therefore, this study focuses on beneficial use of single outside label as a medium delivering long-distance dependency. The design of the precursor-induced CRF apparently allows precedent named entity information to pass through outside labels by induction, even when the model maintains a first-order template. Although the performance improvement is small in both the clinical and biomedical NER evaluations, this study has shown that the proposed design enables reduced computational cost in utilizing long-distance label dependency compared to the second-order CRF.
Evidence from this study suggests that the utilization of outside labels as precedent NE information transmission medium presumably can enhance the expressiveness of the CRF while keeping the first-order template. Considerable work is required to validate the model. For example, the validation of the precursor-induced CRF in deep neural architecture for NER, such as the LSTM-CRF neural architecture (Lample et al., 2016), will be worth performing in the future. In addition, validation of the model in various problems, such as NER in general domain (Tjong, Sang, & Meulder, 2003) and de-identification problem of personal health information in clinical natural language processing (Stubbs, Filannino, & Uzuner, 2017;Stubbs, Kotfila, & Uzuner, 2015), will be performed in the future study.