Hybrid semi-Markov CRF for Neural Sequence Labeling

This paper proposes hybrid semi-Markov conditional random fields (SCRFs) for neural sequence labeling in natural language processing. Based on conventional conditional random fields (CRFs), SCRFs have been designed for the tasks of assigning labels to segments by extracting features from and describing transitions between segments instead of words. In this paper, we improve the existing SCRF methods by employing word-level and segment-level information simultaneously. First, word-level labels are utilized to derive the segment scores in SCRFs. Second, a CRF output layer and an SCRF output layer are integrated into a unified neural network and trained jointly. Experimental results on CoNLL 2003 named entity recognition (NER) shared task show that our model achieves state-of-the-art performance when no external knowledge is used.


Introduction
Sequence labeling, such as part-of-speech (POS) tagging, chunking, and named entity recognition (NER), is a category of fundamental tasks in natural language processing (NLP). Conditional random fields (CRFs) (Lafferty et al., 2001), as probabilistic undirected graphical models, have been widely applied to the sequence labeling tasks considering that they are able to describe the dependencies between adjacent word-level labels and to avoid illegal label combination (e.g., I-ORG can't follow B-LOC in the NER tasks using the BIOES tagging scheme). Original CRFs utilize hand-crafted features which increases the difficulty of performance tuning and domain adaptation. In recent years, neural networks with distributed word representations (i.e., word embeddings) (Mikolov et al., 2013;Pennington et al., 2014) have been introduced to calculate word scores automatically for CRFs (Chiu and Nichols, 2016;Huang et al., 2015).
On the other hand, semi-Markov conditional random fields (SCRFs) (Sarawagi and Cohen, 2005) have been proposed for the tasks of assigning labels to the segments of input sequences, e.g., NER. Different from CRFs, SCRFs adopt segments instead of words as the basic units for feature extraction and transition modeling. The word-level transitions within a segment are usually ignored. Some variations of SCRFs have also been studied. For example, Andrew (2006) extracted segment-level features by combining hand-crafted CRF features and modeled the Markov property between words instead of segments in SCRFs.
With the development of deep learning, some models of combining neural networks and SCRFs have also been studied. Zhuo et al. (2016) and Kong et al. (2015) employed gated recursive convolutional neural networks (grConvs) and segmental recurrent neural networks (SRNNs) to calculate segment scores for SCRFs respectively.
All these existing neural sequence labeling methods using SCRFs only adopted segment-level labels for score calculation and model training. In this paper, we suppose that word-level labels can also contribute to the building of SCRFs and thus design a hybrid SCRF (HSCRF) architecture for neural sequence labeling. In an HSCRF, word-level labels are utilized to derive the segment scores. Further, a CRF output layer and an HSCRF output layer are integrated into a unified neural network and trained jointly. We evaluate our model on CoNLL 2003 English NER task (Sang andMeulder, 2003) and achieve Figure 1: The diagram of a neural network with an HSCRF output layer for sequence labeling.
state-of-the-art performance when no external knowledge is used.
In summary, the contributions of this paper are: (1) we propose the HSCRF architecture which employs both word-level and segment-level labels for segment score calculation. (2) we propose a joint CRF-HSCRF training framework and a naive joint decoding algorithm for neural sequence labeling.
(3) we achieve state-of-the-art performance in CoNLL 2003 NER shared task.

Hybrid semi-Markov CRFs
Let s = {s 1 , s 2 , ..., s p } denote the segmentation of an input sentence x = {x 1 , ..., x n } and w = {w 1 , ..., w n } denote the sequence of word representations of x derived by a neural network as shown in Fig. 1. Each segment s i = (b i , e i , l i ), 0 ≤ i ≤ p, is a triplet of a begin word index b i , an end word index e i and a segment-level label l i , where b 1 = 1, e p = |x|, b i+1 = e i + 1, 0 ≤ e i − b i < L, and L is the upperbound of the length of s i . Correspondingly, let y = {y 1 , ..., y n } denote the word-level labels of x. For example, if a sentence x in NER task is "Barack Hussein Obama and Natasha Obama", we have the corresponding s = ((1, 3, P ER), (4, 4, O), (5, 6, P ER)) and y = (B-PER, I-PER, E-PER, O, B-PER, E-PER). Similar to conventional SCRFs (Sarawagi and Cohen, 2005), the probability of a segmentationŝ in an HSCRF is defined as where S contains all possible segmentations and is the segment score and b i,j is the segment-level transition parameter from class i to class j. Different from existing methods of utilizing SCRFs in neural sequence labeling (Zhuo et al., 2016;Kong et al., 2015) , the segment score in an HSCRF is calculated using word-level labels as where w ′ k is the feature vector of the k-th word, ϕ c (y k , w ′ k ) calculates the score of the k-th word being classified into word-level class y k , and a y k is a weight parameter vector corresponding to class y k . For each word, w ′ k is composed of word representation w k and another two segment-level descriptions, i.e., (1) w e i − w b i which is derived based on the assumption that word representations in the same segment (e.g., "Barack Obama") are closer to each other than otherwise (e.g., "Obama is"), and (2) ] is a vector concatenation operation.
The training and decoding criteria of conventional SCRFs (Sarawagi and Cohen, 2005) are followed. The negative log-likelihood (NLL), i.e., −logp(ŝ|w), is minimized to estimate the parameters of the HSCRF layer and the lower neural network layers that derive word representations. For decoding, the Viterbi algorithm is employed to obtain the optimal segmentation as where S contains all legitimate segmentations.

Jointly training and decoding using CRFs and HSCRFs
To further investigate the effects of word-level labels on the training of SCRFs, we integrate a CRF output layer and a HSCRF output layer into an unified neural network and train them jointly. These two output layers share the same sequence of word representations w which are extracted by lower neural network layers. Given both word-level and segment-level ground truth labels of training sentences, the model parameters are optimized by minimizing the summation of the loss functions of the CRF layer and the HSCRF layer with equal weights. At decoding time, two label sequences, i.e., s c and s h , for an input sentence can be obtained using the CRF output layer and the HSCRF output layer respectively. A naive joint decoding algorithm is also designed to make a selection between them. Assume the NLLs of measuring s c and s h using the CRF and HSCRF layers are N LL c and N LL h respectively. Then, we exchange the models and measure the NLLs of s c and s h by HSCRF and CRF and obtain another two values N LL c by h and N LL h by c . We just naively assign the summation of N LL c and N LL c by h to s c , and the summation of N LL h and N LL h by c to s h . Finally, we choose the one between s c and s h with lower NLL sum as the final result.

Dataset
We evaluated our model on the CoNLL 2003 English NER dataset (Sang andMeulder, 2003). This dataset contained four labels of named entities (PER, LOC, ORG and MISC) and label O for others. The existing separation of training, development and test sets was followed in our experiments. We adopted the same word-level tagging scheme as the one used in Liu et al. (2018) (e.g., BIOES instead of BIO). For better computation efficiency, the max segment length L introduced in Section 2.1 was set to 6, which pruned less than 0.5% training sentences for building SCRFs and had no effect on the development and test sets.

Implementation
As shown in Fig. 1, the GloVe (Pennington et al., 2014) word embedding and the character encoding vector of each word in the input sentence were concatenated and fed into a bi-directional LSTM to obtain the sequence of word representations w. Two character encoding models, LM-BLSTM (Liu et al., 2018) and CNN-BLSTM (Ma and Hovy, 2016), were adopted in our experiments. Regarding with the top classification layer, we compared our proposed HSCRF with conventional word-level CRF and grSemi-CRF (GSCRF) (Zhuo et al., 2016), which was an SCRF using only segment-level information. The descriptions of the models built in our experiments are summarized in Table 1. For a fair comparison, we implemented all models in the same framework using PyTorch library 2 . The hyper-parameters of the models are shown in Table 2 and they were selected according to the two baseline methods without fine-tuning. Each model in Table 1 was estimated 10 times and its mean and standard deviation of F1 score were reported considering the influence of randomness and the weak correlation between development set and test set in this task (Reimers and Gurevych, 2017). Table 1 lists the F1 score results of all built models on CoNLL 2003 NER task. Comparing model 3 with model 1/2 and model 9 with model 7/8, we can see that HSCRF performed better than CRF and GSCRF. The superiorities were significant since the p-values of t-test were smaller than 0.01. This implies the benefits of utilizing word-level labels when deriving segment scores in SCRFs. Comparing model 1 with model 4, 3 with 5, 7 with 10, and 9 with 11, we can see that the jointly training method introduced in Section 2.2 improved the performance of CRF and HSCRF significantly (p < 0.01 in all these four pairs). This may be attributed to that jointly training generates better word representations that can be shared by both CRF and HSCRF decoding layers. Finally, comparing model 6 with model 4/5 and model 12 with model 10/11, we can see the effectiveness of the jointly decoding algorithm introduced in Section 2.2 on improving F1 scores (p < 0.01 in all these four pairs). The LM-BLSTM-JNT model with jointly decoding achieved the highest F1 score among all these built models. Table 3 shows some recent results 3 on the CoNLL 2003 English NER task. For the convenience of comparison, we also listed the maximum F1 scores among 10 repetitions when building our models. The maximum F1 score of our re-implemented CNN-BLSTM-CRF model was slightly worse than the one originally reported in  Ma and Hovy (2016), but it was similar to the one reported in Reimers and Gurevych (2017).

Comparison with existing work
In the NER models listed in Table 3, Zhuo et al. (2016) employed some manual features and calculated segment scores by grConv for SCRF. Lample et al. (2016) and Ma and Hovy (2016) constructed character-level encodings using BLSTM and CNN respectively, and concatenated them with word embeddings. Then, the same BLSTM-CRF architecture was adopted in both models. Rei (2017) fed word embeddings into LSTM to obtain the word representations for CRF decoding and to predict the next word simultaneously. Similarly, Liu et al. (2018)   characters into LSTM to predict the next character and to get the character-level encoding for each word.
Some of the models listed in Table 3 utilized external knowledge beside CoNLL 2003 training set and pre-trained word embeddings. Luo et al. (2015) proposed JERL model, which was trained on both NER and entity linking tasks simultaneously. Chiu and Nichols (2016) employed lexicon features from DBpedia (Auer et al., 2007). Tran et al. (2017) and Peters et al. (2017) utilized pre-trained language models from large corpus to model word representations. Yang et al. (2017) utilized transfer learning to obtain shared information from other tasks, such as chunking and POS  tagging, for word representations. From Table 3, we can see that our CNN-BLSTM-JNT and LM-BLSTM-JNT models with jointly decoding both achieved state-of-the-art F1 scores among all models without using external knowledge. The maximum F1 score achieved by the LM-BLSTM-JNT model was 91.53%.

Analysis
To better understand the effectiveness of wordlevel and segment-level labels on the NER task, we evaluated the performance of models 7, 8, 9 and 12 in Table 3 for entities with different lengths. The mean F1 scores of 10 training repetitions are reported in Table 4. Comparing model 7 with model 8, we can see that GSCRF achieved better performance than CRF for long entities (with more than 4 words) but worse for short entities (with less than 3 words). Comparing model 7 with model 9, we can find that HSCRF outperformed CRF for recognizing long entities and meanwhile achieved comparable performance with CRF for short entities.
One possible explanation is that word-level labels may supervise models to learn word-level descriptions which tend to benefit the recognition of short entities. On the other hand, segmentlevel labels may guide models to capture the descriptions of combining words for whole entities which help to recognize long entities. By utilizing both labels, the LM-BLSTM-HSCRF model can achieve better overall performance of recognizing entities with different lengths. Furthermore, the LM-BLSTM-JNT(JNT) model which adopted jointly training and decoding achieved the best performance among all models shown in Table 4 for all entity lengths.

Conclusions
This paper proposes a hybrid semi-Markov conditional random field (HSCRF) architecture for neural sequence labeling, in which word-level labels are utilized to derive the segment scores in SCRFs.
Further, the methods of training and decoding CRF and HSCRF output layers jointly are also presented. Experimental results on CoNLL 2003 English NER task demonstrated the effectiveness of the proposed HSCRF model which achieved state-of-the-art performance.