Segment-Level Sequence Modeling using Gated Recursive Semi-Markov Conditional Random Fields

Most of the sequence tagging tasks in nat-ural language processing require to recognize segments with certain syntactic role or semantic meaning in a sentence. They are usually tackled with Conditional Random Fields (CRFs), which do indirect word-level modeling over word-level features and thus cannot make full use of segment-level information. Semi-Markov Conditional Random Fields (Semi-CRFs) model segments directly but extracting segment-level features for Semi-CRFs is still a very challenging problem. This paper presents Gated Recursive Semi-CRFs (grSemi-CRFs), which model segments directly and automatically learn segment-level features through a gated recursive convolutional neural network. Our experiments on text chunking and named entity recognition (NER) demonstrate that grSemi-CRFs generally outperform other neural models.


Introduction
Most of the sequence tagging tasks in natural language processing (NLP) are segment-level tasks, such as text chunking and named entity recognition (NER), which require to recognize segments (i.e., a set of continuous words) with certain syntactic role or semantic meaning in a sentence. These tasks are usually tackled with Conditional Random Fields (CRFs) (Lafferty et al., 2001), which do word-level modeling as putting each word a tag, by using some predefined tagging schemes, e.g., the "IOB" scheme (Ramshaw and Marcus, 1995). Such tagging schemes are lossy transformations of original segment tags: They do indicate the boundary of adjacent segments but lose the length information of segments to some extent. Besides, CRFs can only employ word-level features, which are either hand-crafted or extracted with deep neural networks, such as window-based neural networks (Collobert et al., 2011) and bidirectional Long Short-Term Memory networks (BI-LSTMs) . Therefore, CRFs cannot make full use of segmentlevel information, such as inner properties of segments, which cannot be fully encoded in wordlevel features.
Semi-Markov Conditional Random Fields (Semi-CRFs) (Sarawagi and Cohen, 2004) are proposed to model segments directly and thus readily utilize segment-level features that encode useful segment information. Existing work has shown that Semi-CRFs outperform CRFs on segment-level tagging tasks such as sequence segmentation (Andrew, 2006), NER (Sarawagi and Cohen, 2004;Okanohara et al., 2006), web data extraction (Zhu et al., 2007) and opinion extraction (Yang and Cardie, 2012). However, Semi-CRFs need many more features compared to CRFs as they need to model segments with different lengths. As manually designing the features is tedious and often incomplete, how to automatically extract good features becomes a very important problem for Semi-CRFs. A naive solution that builds multiple feature extractors, each of which extracts features for segments with a specific length, is apparently time-consuming. Moreover, some of these separate extractors may underfit as the segments with specific length may be very rare in the training data. By far, Semi-CRFs are lacking of an automatic segment-level feature extractor.
In this paper, we fill the research void by proposing Gated Recursive Semi-Markov Conditional Random Fields (grSemi-CRFs), which can automatically learn features for segment-level sequence tagging tasks. Unlike previous approaches which usually use a neural-based feature extractor with a CRF layer, a grSemi-CRF consists of a gated recursive convolutional neural network (gr-Conv) (Cho et al., 2014) with a Semi-CRF layer. The grConv is a variant of recursive neural networks. It builds a pyramid-like structure to extract segment-level features in a hierarchical way. This feature hierarchy well matches the intuition that long segments are combinations of their short sub-segments. This idea was first explored in Cho et al. (2014) to build an encoder in neural machine translation and then extended to solve other problems, such as sentence-level classification (Zhao et al., 2015) and Chinese word segmentation (Chen et al., 2015). The advantages of grSemi-CRFs are two folds. First, thanks to the pyramid architecture of gr-Convs, grSemi-CRFs can extract all the segmentlevel features using one single feature extractor, and there is no underfitting problem as all parameters of the feature extractor are shared globally. Besides, unlike recurrent neural network (RNN) models, the training and inference of grSemi-CRFs are very fast as there is no time dependency and all the computations can be done in parallel. Second, thanks to the semi-Markov structure of Semi-CRFs, grSemi-CRFs can model segments in sentences directly without the need to introduce extra tagging schemes, which solves the problem that segment length information cannot be fully encoded in tags. Besides, grSemi-CRFs can also utilize segment-level features which can flexibly encode segment-level information such as inner properties of segments, compared to word-level features as used in CRFs. By combining grConvs with Semi-CRFs, we propose a new way to automatically extract segment-level features for Semi-CRFs.
Our major contributions can be summarized as: (1) We propose grSemi-CRFs, which solve both the automatic feature extraction problem for Semi-CRFs and the indirect word-level modeling problem in CRFs. As a result, grSemi-CRFs can do segment-level modeling directly and make full use of segment-level features; (2) We evaluate grSemi-CRFs on two segmentlevel sequence tagging tasks, text chunking and NER. Experimental results show the effectiveness of our model.

Preliminary
In sequence tagging tasks, given a word sequence, the goal is to assign each word (e.g., in POS Tagging) or each segment (e.g., in text chunking and NER) a tag. By leveraging a tagging scheme like "IOB", all the tasks can be regarded as wordlevel tagging. More formally, let X denote the set of words and Y denote the set of tags. A word sentence with length T can be denoted by x = (x 1 , ..., x T ) and its corresponding tags can be denoted as y = (y 1 , ..., y T ). A CRF (Lafferty et al., 2001) defines a conditional distribution where F (y t , x) is the tag score (or potential) for tag y t at position t, A(y t−1 , y t ) is the transition score between y t−1 and y t to measure the tag dependencies of adjacent words and Z(x) = y exp T t=1 F (y t , x) + A(y t−1 , y t ) is the normalization factor. For the common log-linear models, F (y t , x) can be computed by where x) ∈ R D are the word-level features for y t over the sentence x and D is the number of features. f (y t , x) can be manually designed or automatically extracted using neural networks, such as window-based neural networks (Collobert et al., 2011). If we consider segment-level tagging directly 1 , we get a segmentation of the previous tag sentence. With a little abuse of notation, we denote a segmentation by s = (s 1 , ..., s |s| ) in which the jth segment s j = h j , d j , y j consists of a start position h j , a length d j < L where L is a predefined upperbound and a tag y j . Conceptually, s j means a tag y j is given to words (x h j , ..., x h j +d j −1 ). A Semi-CRF (Sarawagi and Cohen, 2004) defines a conditional distribution Figure 1: An overview of grSemi-CRFs. For simplicity, we set the segment length upperbound L = 4, and the sentence length T = 6. The left side is the feature extractor, in which each node denotes a vector of segment-level features (e.g., z k for the kth node in the dth layer). Embeddings of word-level input features are used as length-1 segment-level features, and the length-d feature is extracted from two adjacent length-(d − 1) features. The right side is the Semi-CRF. Tag score vectors are computed as linear transformations of segment-level features and the number of them equals the number of nodes in the same layer. For clarity, we use triangle, square, pentagon and hexagon to denote the tag score vectors for length-1, 2, 3, 4 segments and directed links to denote the tag transformations of adjacent segments.
where F (s j , x) is the potential or tag score for segment s j , A(y j−1 , y j ) is the transition score to measure tag dependencies of adjacent segments and Z(x) = s exp |s | j=1 F (s j , x) + A y j−1 ,y j is the normalization factor. For the common loglinear models, F (s j , x) can be computed by where V = (v 1 , ..., v |Y| ) T ∈ R |Y|×D , b V = (b 1 , ..., b |Y| ) T ∈ R |Y| and f (s j , x) ∈ R D are the segment-level features for s j over the sentence x. As Eq. (1) and Eq. (3) show, CRFs can be regarded as a special case of Semi-CRFs when L = 1. CRFs need features for only length-1 segments (i.e., words), while Semi-CRFs need features for length-segments (1 ≤ ≤ L). Therefore, to model the same sentence, Semi-CRFs generally need many more features than CRFs, especially when L is large. Besides, unlike word-level features used in CRFs, the sources of segment-level features are often quite limited. In existing work, the sources of f (s j , x) can be roughly divided into two parts: (1) Concatenations of word-level features (Sarawagi and Cohen, 2004;Okanohara et al., 2006); and (2) Hand-crafted segment-level features, including task-insensitive features, like the length of segments, and task-specific features, like the verb phrase patterns in opinion extraction (Yang and Cardie, 2012). As manually designing features is time-consuming and often hard to capture rich statistics underlying the data, how to automatically extract features for Semi-CRFs remains a challenge.

Gated Recursive Semi-Markov CRFs
In this section, we present Gated Recursive Semi-CRFs (grSemi-CRFs), which inherit the advantages of Semi-CRFs in segment-level modeling, and also solve the feature extraction problem of Semi-CRFs by introducing a gated recursive convolutional neural network (grConv) as the feature extractor. Instead of building multiple feature extractors at different scales of segment lengths, grSemi-CRFs can extract features with any length by using a single grConv, and learn the parameters effectively via sharing statistics.

Architecture
The architecture of grSemi-CRFs is illustrated in Figure 1. A grSemi-CRF can be divided into two parts, a feature extractor (i.e., grConv) and a Semi-CRF. Below, we explain each part in turn.
As is shown, the feature extractor is a pyramidlike directed acyclic graph (DAG), in which nodes Figure 2: The the building block of the feature extractor (i.e., grConv), in which parameters are shared among the pyramid structure. We omit the are stacked layer by layer and information is propagated from adjacent nodes in the same layer to their co-descendants in the higher layer through directed links. Recall that L denotes the upperbound of segment length (i.e., the height of the grConv), we regard the bottom level as the 1st level and the top level as the Lth level. Then, for a length-T sentence, the dth level will have T − d + 1 nodes, which correspond to features for T − d + 1 length-d segments. The kth node in the dth layer corresponds to the segment-level latent features z (d) k ∈ R D , which denote the meaning of the segment, e.g., the syntactic role (i.e., for text chunking) or semantic meaning (i.e., for NER).
Like CRFs, grSemi-CRFs allow word-level categorical inputs (i.e., x k ) which are transformed into continuous vectors (i.e., embeddings) according to look-up tables and then used as length-1 segment-level features (i.e., z (1) k ). To be clear, we call these inputs as input features and those extracted segment-level features (i.e., z (d) k ) as segment-level latent features. Besides, grSemi-CRFs also allow segment-level input features (e.g., gazetteers) directly as shown in Eq. (12). We will discuss more details in section 4.3.
The building block of the feature extractor is shown in Figure 2, where an intermediate nodê where W L , W R ∈ R D×D and b W ∈ R D are shared globally, and g(·) is a non-linear activation function 2 .
Then, the length-d segment-level latent features z where θ L , θ M and θ R ∈ R are the gating coefficients which satisfy the condition θ L , θ R , θ M ≥ 0 and θ L + θ R + θ M = 1. Here, we make a little modification of grConvs by making the gating coefficients as vectors instead of scalars, i.e., where • denotes the element-wise product and θ L , θ R and θ M ∈ R D are vectorial gating coefficients 3 which satisfy the condition that There are two reasons for this modification: (1) Theoretically, the element-wise combination makes a detailed modeling as each feature in z (d) k may have its own combining; and (2) Experimentally, this setting makes our grSemi-CRF 4 more flexible, which increases its generalizability and leads to better performance in experiments as shown in Table 4.
We can regard Eq. (7) as a soft gate function to control the propagation flows. Besides, all the parameters (i.e., W L , W R , b W , G L , G R , b G ) are shared globally and recursively applied to the input sentence in a bottom-up manner. All of these account for the name gated recursive convolutional neural networks (grConvs).
Eq. (5) and Eq. (7) build the information propagation criteria in a grConv. The basic assumption behind Eq. (5) and Eq. (7) is that the meaning of one segment can be represented as a linear combination of three parts: (1) the meaning of its prefix segment, (2) the meaning of its suffix segment and (3) the joint meaning of both (i.e., the complex interaction). This process matches our intuition about the hierarchical structure in the composition of a sentence. For example, the meaning of the United States depends on the suffix segment United States, whose meaning is not only from its prefix United or suffix States, but the interaction of both.
The vectorial gating coefficients θ L , θ R and θ M are computed adaptively, i.e., where G L , G R ∈ R 3D×D and b G ∈ R 3D are shared globally. Z ∈ R d is normalization coefficients and the ith element of Z is computed via After the forward propagation of the feature extractor is over, the tag scores (i.e., the potential functions for Semi-CRFs) are computed through a linear transformation. For segment and corresponding potential/tag score is where V (d j ) 0 ∈ R |Y|×D and b (d j ) ∈ R |Y| are parameters for length-d j segments. To encode contextual information, we can assume that the tag of a segment depends not only on itself but also its neighbouring segments with the same length, i.e., where V ∈ R |Y|×D and H is the window width for neighbouring segments. Apart from the automatically extracted segmentlevel latent features z (d) k , grSemi-CRFs also allow segment-level input features (e.g., gazetteers), i.e., where U (d j ) ∈ R |Y|×D and c Then, we can use Eq. (3) for inference by using a Semi-CRF version of Viterbi algorithms (Sarawagi and Cohen, 2004).

Learning of Parameters
To learn grSemi-CRFs, we maximize the log likelihood L = log p(s|x) over all the parameters. Here, for notation simplity, we consider the simpliest case, i.e., using Eq. (10) to compute tag scores. More details can be found in the supplementary note.
Gradients of Semi-CRF-based parameters (i.e., A and V 0 ) and tag scores F (s j , x) can be computed based on the marginal probability of neighbouring segments via a Semi-CRF version of forward-backward algorithms (Sarawagi and Cohen, 2004). As for the grConv-based parameters, we can compute their gradients by back propagation. For example, gradients for W L and G L are 5 where s (d) k = k, d, Y is a length-|Y| vector which denotes segments with all possible tags for z (15) where diag(θ L ) denotes the diagonal matrix spanned by vector θ L , and has a similar form. As Eq. (14) shows, for each node in the feature extractor of grSemi-CRFs, its gradient consists of two parts: (1) the gradients back propagated from high layer nodes (i.e., longer segments); and (2) the supervising signals from Semi-CRFs. In other words, the supervision in the objective function is added to each node in grSemi-CRFs. This is a nice property compared to other neural-based feature extractors used in CRFs, in which only the nodes of several layers on the top receive supervision. Besides, the term diag(θ L ) in Eq. (15) prevents from being too small when g (α (d) k ) and W L are small, which acts as the linear unit recurrent connection in the memory block of LSTM (Hochreiter and Schmidhuber, 1997;Zhao et al., 2015). All of these help in avoiding gradient vanishing problems in training grSemi-CRFs.

Experiments
We evaluate grSemi-CRFs on two segment-level sequence tagging NLP tasks: text chunking and named entity recognition (NER).

Datasets
For text chunking, we use the CONLL 2000 text chunking shared dataset 6 (Tjong Kim Sang and Buchholz, 2000), in which the objective is to divide the whole sentence into different segments according to their syntactic roles, such as noun phrases ("NP"), verb phrases ("VP") and adjective phrases ("ADJP"). We call it a "segment-rich" tasks as the number of phrases are much higher than that of non-phrases which is tagged with others ("O"). We evaluate performance over all the chunks instead of only noun pharse (NP) chunks.
For NER, we use the CONLL 2003 named entity recognition shared dataset 7 (Tjong Kim Sang and De Meulder, 2003), in which segments are tagged with one of four entity types: person ("PER"), location ("LOC"), organization ("ORG") and miscellaneous("MISC"), or others ("O") which is used to denote non-entities. We call it a "segment-sparse" task as entities are rare while non-entities are common.

Input Features
For each word, we use multiple input features, including the word itself, its length-3 prefix and length-4 suffix, its capitalization pattern, its POS tag, the length-4,8,12,20 prefixs of its Brown clusters (Brown et al., 1992) and gazetteers 8 . All of them are used as word-level input features except gazetteers, which are used as segment-level features directly. All the embeddings for word-level inputs are randomly initialized except word embeddings, which can be initialized randomly or by pretraining over unlabeled data, which is external information compared to the dataset. Besides word embeddings, Brown clusters and gazetteers are also based on external information, as summarized below: • Word embeddings. We use Senna embeddings 9 (Collobert et al., 2011), which are 50dimensional and have been commonly used in sequence tagging tasks (Collobert et al., 2011;Turian et al., 2010;; • Brown clusters. We train two types of Brown clusters using the implementation from Liang (2005): (1) We follow the setups of Roth (2009), Turian et al. (2010) and Collobert et al. (2011) to generate 1000 Brown clusters on Reuters RCV1 dataset (Lewis et al., 2004); (2) We generate 1000 Brown clusters on New York Times (NYT) corpus (Sandhaus, 2008); • Gazetteers. We build our gazetteers based on the gazetteers used in Senna (Collobert et al., 2011) and Wikipedia entries, mainly the locations and organizations. We also denoise our gazetteers by removing overlapped entities and using BBN Pronoun Coreference and Entity Type Corpus (Weischedel and Brunstein, 2005) as filters 10 .

Implementation Details
To learn grSemi-CRFs, we employ Adagrad (Duchi et al., 2011), an adaptive stochastic gradient descent method which has been proved successful in similar tasks (Chen et al., 2015;Zhao et al., 2015). To avoid overfitting, we use the dropout strategy (Srivastava et al., 2014) and apply it on the first layer (i.e., z k ). We also use the strategy of ensemble classifiers, which is proved an effective way to improve generalization performance (Collobert et al., 2011). All results are obtained by decoding over an average Semi-CRF after 10 training runs with randomly initialized parameters.
For the CONLL 2003 dataset, we use the F 1 scores on the development set to help choose the best-performed model in each run. For the CONLL 2000 dataset, as there is no development set provided, we use cross validation as Turian et al. (2010) to choose hyperparameters. After that, we retrain model according to the hyperparameters and choose the final model in each run.
Our hyperparameter settings for these two tasks are shown in Table 1. The segment length is set according to the maximum segment length in training set. We set the minibatch size to 10, which means that we process 10 sentences in a batch. The window width defines the parameter H in Eq. (12) when producing tag score vectors.   Table 2 shows the results of our grSemi-CRFs and other models 11 . We divide other models into two categories, i.e., neural models and non-neural models, according to whether neural networks are used as automatic feature extractors. For neural models, Senna (Collobert et al., 2011) consists of a window-based neural network for feature extraction and a CRF for word-level modeling while BI-LSTM-CRF  uses a bidirectional Long Short-Term Memory network for feature extraction and a CRF for word-level modeling.

Results and Analysis
For non-neural models, JESS-CM (Suzuki and Isozaki, 2008) is a semi-supervised model which combines Hidden Markov Models (HMMs) with CRFs and uses 1 billion unlabelled words in training. Lin and Wu (2009) cluster 20 million phrases over corpus with around 700 billion tokens, and use the resulting clusters as features in CRFs. Passos et al. (2014) propose a novel word embedding method which incorporates gazetteers as supervising signals in pretraining and builds a loglinear CRF over them. Ratinov and Roth (2009) use CRFs based on many non-local features and 30 gazetteers extracted from Wikipedia and other websites with more than 1.5 million entities.
As Table 2 shows, grSemi-CRFs outperform other neural models, in both text chunking and named entity recognition (NER) tasks. BI-LSTM-CRFs use many more input features than ours, which accounts for the phenomenon that the performance of our grSemi-CRFs is rather mediocre (i.e., 93.92% versus 94.13% and 84.66% versus 84.26%) without external information. However, once using Senna embeddings, our grSemi-CRFs perform much better than BI-LSTM-CRFs.
For non-neural models, one similarity of them is that they use a lot of hand-crafted features, and many of them are even task-specific. Unlike them,  grSemi-CRFs use much fewer input features and most of them are task-insensitive 13 . However, grSemi-CRFs achieve almost the same performance, sometimes even better. For text chunking, grSemi-CRF outperforms all reported supervised models, except JESS-CM (Suzuki and Isozaki, 2008), a semi-supervised model using giga-word scale unlabeled data in training 14 . However, the performance of our grSemi-CRF (95.01%) is very close to that of JESS-CM (95.15%). For NER, the performance of grSemi-CRFs are also very closed to state-of-the-art results (90.87% versus 90.90%).

Impact of External Information
As Table 3 shows, external information improve the performance of grSemi-CRFs for both tasks.
Compared to text chunking, we can find out that external information plays an extremely important role in NER, which coincides with the general idea that NER is a knowledge-intensive task (Ratinov and Roth, 2009

Impact of Vectorial Gating Coefficients
As Table 4 shows, a grSemi-CRF using vectorial gating coefficients (i.e., Eq. (7)) performs better than that using scalar gating coefficients (i.e., Eq. (6)), which provides evidences for the theoretical intuition that vectorial gating coefficients can make a detailed modeling of the combinations of segment-level latent features and thus performs better than scalar gating coefficients.

Visualization of Learnt Segment-Level Features
To demonstrate the quality of learnt segmentlevel features, we use an indirect way as widely adopted in previous work, e.g., Collobert et al. (2011). More specifically, we show 10 nearest neighbours for some selected queried segments according to Euclidean metric of corresponding features 15 . To fully demonstrate the power of grSemi-CRF in learning segment-level features, we use the Emb+Brown(RCV1) model in Table  3, which uses no gazetteers. We train the model on the CONLL 2003 training set and find nearest neighbours in the CONLL 2003 test set. We make no restrictions on segments, i.e., all possible segments with different lengths in the CONLL 2003 test set are candidates. As Table 5 shown, most of the nearest segments are meaningful and semantically related. For example, the nearest segments for "Filippo Inzaghi" are not only tagged with person, but also names of famous football players as "Filippo Inzaghi".
There also exist some imperfect results. E.g., for "Central African Republic", nearest segments, which contain the same queried segment, are semantically related but not syntactically similar. The major reason may be that the CONLL 2003 dataset is a small corpus (if compared to the vast unlabelled data used to train Senna embeddings), which restricts the range for candidate segments and the quality of learnt segment-level features. Another reason is that labels in the CONLL 2003 dataset mainly encodes semantic information (e.g., named entities) instead of syntactic information (e.g., chunks).
Besides, as we make no restriction on the formulation of candidate segments, sometimes only a part of the whole phrase will be retrieved, e.g., "FC Hansa", which is the prefix of "FC Hansa Rostock". Exploring better way of utilizing unla-  belled data to improve learning segment-level features is part of the future work. Cho et al. (2014) first propose grConvs to learn fix-length representations of the whole source sentence in neural machine translation. Zhao et al. (2015) use grConvs to learn hierarchical representations (i.e., multiple fix-length representations) of the whole sentence for sentence-level classification problem. Both of them focus on sentencelevel classification problems while grSemi-CRFs are solving segment-level classification (sequence tagging) problems, which is fine-grained. Chen et al. (2015) propose Gated Recursive Neural Networks (GRNNs), a variant of grConvs, to solve Chinese word segmentation problem. GRNNs still do word-level modeling by using CRFs while grSemi-CRFs do segment-level modeling directly by using semi-CRFs and makes full use of the recursive structure of grConvs. We believe that, the recursive neural network (e.g., grConv) is a natural feature extractor for Semi-CRFs, as it extracts features for every possible segments by one propagation over one trained model, which is fast-computing and efficient. In this sense, grSemi-CRFs provide a promising direction to explore.

Conclusions
In this paper, we propose Gated Recursive Semi-Markov Conditional Random Fields (grSemi-CRFs) for segment-level sequence tagging tasks. Unlike word-level models such as CRFs, grSemi-CRFs model segments directly without the need of using extra tagging schemes and also readily utilize segment-level features, both hand-crafted and automatically extracted by a grConv. Experimental evaluations demonstrate the effectiveness of grSemi-CRFs on both text chunking and NER tasks.
In future work, we are interested in exploring better ways of utilizing vast unlabelled data to improve grSemi-CRFs, e.g., to learn phrase embeddings from unlabelled data or designing a semisupervised version of grSemi-CRFs.