GCDT: A Global Context Enhanced Deep Transition Architecture for Sequence Labeling

Current state-of-the-art systems for sequence labeling are typically based on the family of Recurrent Neural Networks (RNNs). However, the shallow connections between consecutive hidden states of RNNs and insufficient modeling of global information restrict the potential performance of those models. In this paper, we try to address these issues, and thus propose a Global Context enhanced Deep Transition architecture for sequence labeling named GCDT. We deepen the state transition path at each position in a sentence, and further assign every token with a global representation learned from the entire sentence. Experiments on two standard sequence labeling tasks show that, given only training data and the ubiquitous word embeddings (Glove), our GCDT achieves 91.96 F1 on the CoNLL03 NER task and 95.43 F1 on the CoNLL2000 Chunking task, which outperforms the best reported results under the same settings. Furthermore, by leveraging BERT as an additional resource, we establish new state-of-the-art results with 93.47 F1 on NER and 97.30 F1 on Chunking.


Introduction
Sequence labeling tasks, including part-of-speech tagging (POS), syntactic chunking and named entity recognition (NER), are fundamental and challenging problems of Natural Language Processing (NLP). Recently, neural models have become the de-facto standard for high-performance systems. Among various neural networks for sequence labeling, bi-directional RNNs (BiRNNs), especially BiLSTMs (Hochreiter and Schmidhuber, 1997) have become a dominant method on multiple benchmark datasets Chiu and Nichols, 2016;Lample et al., 2016;Peters et al., 2017).
However, there are several natural limitations of the BiLSTMs architecture. For example, at each time step, the BiLSTMs consume an incoming word and construct a new summary of the past subsequence. This procedure should be highly nonlinear, to allow the hidden states to rapidly adapt to the mutable input while still preserving a useful summary of the past (Pascanu et al., 2014). While in BiLSTMs, even stacked BiLSTMs, the transition depth between consecutive hidden states are inherently shallow. Moreover, global contextual information, which has been shown highly useful for model sequence (Zhang et al., 2018), is insufficiently captured at each token position in BiLSTMs. Subsequently, inadequate representations flow into the final prediction layer, which leads to the restricted performance of BiLSTMs.
In this paper, we present a global context enhanced deep transition architecture to eliminate the mentioned limitations of BiLSTMs. In particular, we base our network on the deep transition (DT) RNN (Pascanu et al., 2014), which increases the transition depth between consecutive hidden states for richer representations. Furthermore, we assign each token an additional representation, which is a summation of hidden states of a specific DT over the whole input sentence, namely global contextual embedding. It's beneficial to make more accurate predictions since the combinatorial computing between diverse token embeddings and global contextual embedding can capture useful representations in a way that improves the overall system performance.
We evaluate our GCDT on both CoNLL03 and CoNLL2000. Extensive experiments on two benchmarks suggest that, merely given training data and publicly available word embeddings (Glove), our GCDT surpasses previous state-ofthe-art systems on both tasks. Furthermore, by exploiting BERT as an extra resource, we report new state-of-the-art F 1 scores with 93.47 on CoNLL03 and 97.30 on CoNLL2000. The main contributions of this paper can be summarized as follows: • We are the first to introduce the deep transition architecture for sequence labeling, and further enhance it with the global contextual representation at the sentence level, named GCDT.
• GCDT substantially outperforms previous systems on two major tasks of NER and Chunking. Moreover, by leveraging BERT as an extra resource to enhance GCDT, we report new state-of-the-art results on both tasks.
• We conduct elaborate investigations of global contextual representation, model complexity and effects of various components in GCDT.

Background
Given a sequence of X = {x 1 , x 2 , · · · , x N } with N tokens and its corresponding linguistic labels Y = {y 1 , y 2 , · · · , y N } with the equal length, the sequence labeling tasks aim to learn a parameterized mapping function f θ : X → Y from input tokens to task-specific labels. Typically, the input sentence is firstly encoded into a sequence of distributed representations X = {x 1 , x 2 , · · · , x N } by character-aware and pretrained word embeddings. The majority of highperformance models use bidirectional RNNs, BiL-STMs in particular, to encode the token embeddings X into context-sensitive representations for the final prediction.
Additionally, it's beneficial to model and predict labels jointly, thus a subsequent conditional random field (CRF Lafferty et al., 2001) is commonly utilized as a decoder layer. At the training stage, those models maximize the log probability of the correct sequence of tags as follows: where s(·) is the score function and Y x is the set of all possible sequence of tags. Typically, the Viterbi algorithm (Forney, 1973) is utilized to search the label sequences with maximum score when decoding: 3 GCDT

Overview
In this section, we start with a brief overview of our presented GCDT and then proceed to structure the following sections with more details about each submodule. As shown in Figure 1, there are three deep transition modules in our model, namely global contextual encoder, sequence labeling encoder and decoder accordingly.
Token Representation Given a sentence X = {x 1 , x 2 , ..., X N } with N tokens, our model first captures each token representation x t by concatenating three primary embeddings: 3. Global contextual embedding g is extracted from bidirectional DT, and more details will be described in the following paragraphs.
The global embedding g is computed by mean pooling over all hidden states {h g 1 , h g 2 , · · · , h g N } of global contextual encoder (right part in Figure  1). For simplicity, we can take "DT" as a reinforced Gated Recurrent Unit (GRU Chung et al., 2014), and more details about DT will be described in the next section. Thus g is computed as follows: x 1 x 2 x n x 1 x n-1 Sequence Labeling Encoder Subsequently, the concatenated token embeddings x t (Eq. 3) is fed into the sequence labeling encoder (bottom left part in Figure 1).
Sequence Labeling Decoder Considering the t-th word in this sentence, the output of sequence labeling encoder h t along with the past label embedding y t−1 are fed into the decoder (top left part in Figure 1). Subsequently, the output of decoder s t is transformed into l t for the final softmax over the tag vocabulary. Formally, the label of word x t is predicted as the probabilistic equation (Eq. 13) As we can see from the above procedures and Figure 1, our GCDT firstly encodes the global contextual representation along the sequential axis by DT, which is utilized to enrich token representations. At each time step, we encode the past label information jointly using the sequence labeling decoder instead of resorting to CRF. Additionally, we employ beam search algorithm to infer the most probable sequence of labels when testing.

Deep Transition RNN
Deep transition RNNs extend conventional RNNs by increasing the transition depth of consecutive hidden states. Previous studies have shown the superiority of this architecture on both language modeling (Pascanu et al., 2014) and machine translation (Barone et al., 2017;Meng and Zhang, 2019). Particularly, Meng and Zhang (2019) propose to maintain a linear transformation path throughout the deep transition procedure with a linear gate to enhance the transition structure. Following Meng and Zhang (2019), the deep transition block in our hierarchical model is composed of two key components, namely Linear Transformation enhanced GRU (L-GRU) and Transition GRU (T-GRU). At each time step, L-GRU first encodes each token with an additional linear transformation of the input embedding, then the hidden state of L-GRU is passed into a chain of T-GRU connected merely by hidden states. Afterwards, the output "state" of the last T-GRU for the current time step is carried over as "state" input of the first L-GRU for the next time step. Formally, in a unidirectional network with transition number of L, the hidden state of the t-th token in a sentence is computed as: Linear Transformation Enhanced GRU L-GRU extends the conventional GRU by an additional linear transformation of the input token embeddings. At time step t, the hidden state of L-GRU is computed as follows: where W xh and W hh are parameter matrices, and reset gate r t and update gate z t are same as GRU: The linear transformation W x x t in candidate hidden state h t (Eq. 17) is regulated by the linear gate l t , which is computed as follows: Transition GRU T-GRU is a special case of conventional GRU, which only takes hidden states from the adjacent lower layer as inputs. At time step t at transition depth l, the hidden state of T-GRU is computed as follows: Reset gate r t and update gate z t also only take hidden states as input, which are computed as: As indicated above, at each time step of our deep transition block, there is a L-GRU in the bottom and several T-GRUs on the top of L-GRU.

Local Word Representation
Charater-aware word embeddings It has been demonstrated that character level information (such as capitalization, prefix and suffix) (Collobert et al., 2011;dos Santos and Zadrozny, 2014) is crucial for sequence labeling tasks. In our GCDT, the character sets consist of all unique characters in datasets besides the special symbol "PAD" and "UNK". We use one layer of CNN followed by max pooling to generate character-aware word embeddings.
Pre-trained word embeddings The pre-trained word embeddings have been indicated as a standard component of neural network architectures for various NLP tasks. Since the capitalization feature of words is crucial for sequence labeling tasks (Collobert et al., 2011), we adopt word embeddings trained in the case sensitive schema.
Both the character-aware and pre-trained word embeddings are context-insensitive, which are called local word representations compared with global contextual embedding in the next section.

Global Contextual Embedding
We adopt an independent deep transition RNN named global contextual encoder (right part in Figure 1) to capture global features. In particular, we transform the hidden states of global contextual encoder into a fixed-size vector with various strategies, such as mean pooling, max pooling and self-attention mechanism (Vaswani et al., 2017). According to the preliminary experiments, we choose mean pooling strategy considering the balance between effect and efficiency.
In conventional BiRNNs, the global contextual feature is insufficiently modeled at each position, as the nature of recurrent architecture makes RNN partial to the most recent input token. While our context-aware representation is incorporated with local word embeddings directly, which assists in capturing useful representations through combinatorial computing between diverse local word embeddings and the global contextual embedding. We further investigate the effects on positions where the global embedding is used. (Section 5.1) Metric We adopt the BIOES tagging scheme for both tasks instead of the standard BIO2, since previous studies have highlighted meaningful improvements with this scheme (Ratinov and Roth, 2009). We take the official conlleval 3 as the token-level F 1 metric. Since the data size if relatively small, we train each final model for 5 times with different parameter initialization and report the mean and standard deviation F 1 value.

Implementation Details
All trainable parameters in our model are initialized by the method described by Glorot and Bengio (2010). We apply dropout (Srivastava et al., 2014) to embeddings and hidden states with a rate of 0.5 and 0.3 respectively. All models are optimized by the Adam optimizer (Kingma and Ba, 2014) with gradient clipping of 5 (Pascanu et al., 2013). The initial learning rate α is set to 0.008, and decrease with the growth of training steps. We monitor the training process on the development set and report the final result on the test set. One layer CNN with a filter of size 3 is utilized to generate 128-dimension word embeddings by max pooling. The cased, 300d Glove is adapted to initialize word embeddings, which is frozen in all models. In the auxiliary experiments, the output hidden states of BERT are taken as additional word embeddings and kept fixed all the time.

Main Results
The main results of our GCDT on the CoNLL03 and CoNLL2000 are illustrated in Table 1 and  Table 2 respectively. Given only standard training data and publicly available word embeddings, our GCDT achieves state-of-the-art results on both tasks. It should be noted that some results on NER are not comparable to ours directly, as their final models are trained on both training and development data 4 . More notably, our GCDT surpasses the models that exploit additional task-specific resources or annotated corpora (Luo et al., 2015;Yang et al., 2017b;Chiu and Nichols, 2016). Additionally, we conduct experiments by leveraging the well-known BERT as an external resource for relatively fair comparison with models Models F 1 (Rei, 2017) 86.26 (Liu et al., 2017) 91.71 ± 0.10 (Peters et al., 2017) † 91.93 ± 0.19 (Peters et al., 2018) 92.20  92.61 (2018) Table 3: F 1 scores on the CoNL03 NER task by leveraging language model, † refers to models trained on both training and development data. We establish new state-of-the-art result on this task.

Analysis
We choose the CoNLL03 NER task as example to elucidate the properties of our GCDT and conduct several additional experiments.

Where to Use the Global Representation?
In this experiment, we investigate the effects of locations on the global contextual embedding in our hierarchical model. In particular, we use the global embedding g to augment: • input of final softmax layer ; x  • input of sequence labeling encoder; x encoder k = [w k ; c k ; g] Table 5 shows that the global embedding g improves performance when utilized at the relative low layer (row 3) , while g may do harm to performances when adapted at the higher layers (row 0 vs. row 1 & 2). In the last option, g is incorporated to enhance the input token representation for sequence labeling encoder, the combinatorial computing between the multi-granular local word embeddings (w k and c k ) and global embedding g can capture more specific and richer representations for the prediction of each token, and thus improves overall system performance. While the other two options (row 1, 2) concatenate the highly abstract g with hidden states (h encoder k or h decoder k ) from the higher layers, which may bring noise to token representation due to the similar feature spaces and thus hurt task-specific predictions.

Comparing with Stacked RNNs
Although our proposed GCDT bears some resemblance to the conventional stacked RNNs, they are very different from each other. Firstly, although the stacked RNNs can process very deep architectures, the transition depth between consecutive hidden states in the token level is still shallow.
Secondly, in the stacked RNNs, the hidden states along the sequential axis are simply fed into the corresponding positions of the higher layers, namely only position-aware features are transmitted in the deep architecture. While in GCDT, the internal states in all token position of the global contextual encoder are transformed into a fixedsize vector. This contextual-aware representation provides more general and informative features of the entire sentence compared with stacked RNNs.
To obtain rigorous comparisons, we stack two layers of deep transition RNNs instead of conventional RNNs with similar parameter numbers of GCDT. According to the results in Table 6, the stacked-DT improves the performance of the orig-  inal DT slightly, while there is still a large margin between GCDT and the stacked-DT. As we can see, our GCDT achieves a much better performance than stacked-DT with a smaller parameter size, which further verifies that our GCDT can effectively leverage global information to learn more useful representations for sequence labeling tasks.

Ablation Experiments
We conduct ablation experiments to investigate the impacts of various components in GCDT. More specifically, we remove one kind of token embedding from char-aware, pre-trained and global embeddings for sequence labeling encoder each time, and utilize DT or conventional GRU with similar model sizes 5 . Results of different combinations are presented in Table 7. Given the same input embeddings, DT surpasses the conventional GRU substantially in most cases, which further demonstrates the superiority of DT in sequence labeling tasks. Our observations on character-level and pre-trained word embeddings suggest that they have a significant impact on highly competitive results (row 1 & 3 vs. row 5), which is consistent with previous work (dos Santos and Zadrozny, 2014;Lample et al., 2016). Furthermore, the global contextual embedding substantially improves the performances on both DT and GRU based models (row 6 & 7 vs. row 4 & 5).

Effect of BERT
WordPiece is adopted to tokenize sequence in BERT, which may cut a word into pieces, such as converting "Johanson" into "Johan ##son". Therefore, additional efforts should be taken to maintain alignments between input tokens and their corresponding labels. Three strategies are conducted to obtain the exclusive BERT embedding of each token in a sequence. Firstly, we take the first subword as the whole word embedding after tokenization, which is employed in the original paper of   Table 8: Comparison of CoNLL03 F 1 scores when various types, layers and pooling strategies of BERT are employed. "first" indicates the first sub-word embedding, "mean" and "max" refer to mean and max pooling correspondingly.
BERT (Devlin et al., 2018). Mean and max poolings are used as the latter two strategies. Results of various combinations of BERT type, layer and pooling strategy are illustrated in Table 8. It's reasonable that BERT trained on large model surpasses the smaller one in most cases due to the larger model capacity and richer contextual representation. For the pooling strategy, "mean" is considered to capture more comprehensive representations of rare words than "first" and "max", thus better average performances. Additionally, we hypothesize that the higher layers in BERT encode more abstract and semantic features, while the lower ones prefer general and syntax infor-  Table 9: F 1 scores on the CoNLL03 and parameter sizes of various models, where "GRU-384" indicates the conventional GRU with hidden size of 384, while "DT2-128" refers to deep transition RNN with transition number of 2 and hidden size of 128, similarly for "DT4-256". mation, which is more helpful for our NER and Chunking tasks. These hypotheses are consistent with results emerged in Table 8.

Model Complexity
One way of measuring the complexity of a neural model is through the total number of trainable parameters. In GCDT, the global contextual encoder increases parameter numbers of the sequence labeling encoder due to the enlargement of input dimensions, thus we run additional experiments to verify whether the increment of parameters has a great affection on performances. Empirically, we replace DT with conventional GRU in the global contextual encoder and sequence labeling module (both encoder and decoder) respectively. Results of various combinations are shown in Table 9.
Observations on parameter numbers show that DT outperforms GRU substantially, with a smaller size (row 4 & 5 vs. row 0). From the perspective of global contextual encoder, DT gives slightly better result compared with GRU (row 3 vs. row 0). We observe similar results in the sequence labeling module (row 1 & 2 vs. row 0). Intuitively, it should further improve performance when utilizing DT in both modules, which is consistent with the observations in Table 9 (row 4 & 5 vs. row 0).

Related Work
Neural Sequence Labeling Collobert et al. (2011) propose a seminal neural architecture for sequence labeling, which learns useful representation from pre-trained word embeddings instead of hand-crafted features.  develop the outstanding BiLSTMs-CRF architecture, which is improved by incorporating character-level LSTM (Lample et al., 2016), GRU (Yang et al., 2016), CNN (dos Santos and Zadrozny, 2014;Xin et al., 2018), IntNet (Xin et al., 2018). The shallow connections between consecutive hidden states in those models inspire us to deepen the transition path for richer representation.
More recently, there has been a growing body of work exploring to leverage language model trained on massive corpora in both character level (Peters et al., 2017(Peters et al., , 2018Akbik et al., 2018) and token level (Devlin et al., 2018). Inspired by the effectiveness of language model embeddings, we conduct auxiliary experiments by leveraging the well-known BERT as an additional feature.
Exploit Global Information Chieu and Ng (2002) explore the usage of global feature in the whole document by the co-occurrence of each token, which is fed into a maximum entropy classifier. With the widespread application of distributed word representations  and neural networks (Collobert et al., 2011; in sequence labeling tasks, the global information is encoded into hidden states of BiRNNs. Specially, Yang et al. (2017a) leverage global sentence patterns for NER reranking. Inspired by the global sentence-level representation in S-LSTM (Zhang et al., 2018), we propose a more concise approach to capture global information, which has been demonstrated more effective on sequence lableing tasks.
Deep Transition RNN Deep transition recurrent architecture extends conventional RNNs by increasing the transition depth between consecutive hidden states. Previous studies have shown the superiority of this architecture on both language model (Pascanu et al., 2014) and machine translation (Barone et al., 2017;Meng and Zhang, 2019). We follow the deep transition architecture in (Meng and Zhang, 2019), and extend it into a hierarchical model with the global contextual representation at the sentence level for sequence labeling tasks.

Conclusion
We propose a novel hierarchical neural model for sequence labeling tasks (GCDT), which is based on the deep transition architecture and motivated by global contextual representation at the sentence level. Empirical studies on two standard datasets suggest that GCDT outperforms previous state-ofthe-art systems substantially on both CoNLL03 NER task and CoNLL2000 Chunking task without additional corpora or task-specific resources. Furthermore, by leveraging BERT as an external resource, we report new state-of-the-art F 1 scores of 93.47 on CoNLL03 and 97.30 on CoNLL2000.
In the future, we would like to extend GCDT to other analogous sequence labeling tasks and explore its effectiveness on other languages.