Embedded-State Latent Conditional Random Fields for Sequence Labeling

Complex textual information extraction tasks are often posed as sequence labeling or shallow parsing, where fields are extracted using local labels made consistent through probabilistic inference in a graphical model with constrained transitions. Recently, it has become common to locally parametrize these models using rich features extracted by recurrent neural networks (such as LSTM), while enforcing consistent outputs through a simple linear-chain model, representing Markovian dependencies between successive labels. However, the simple graphical model structure belies the often complex non-local constraints between output labels. For example, many fields, such as a first name, can only occur a fixed number of times, or in the presence of other fields. While RNNs have provided increasingly powerful context-aware local features for sequence tagging, they have yet to be integrated with a global graphical model of similar expressivity in the output distribution. Our model goes beyond the linear chain CRF to incorporate multiple hidden states per output label, but parametrizes them parsimoniously with low-rank log-potential scoring matrices, effectively learning an embedding space for hidden states. This augmented latent space of inference variables complements the rich feature representation of the RNN, and allows exact global inference obeying complex, learned non-local output constraints. We experiment with several datasets and show that the model outperforms baseline CRF+RNN models when global output constraints are necessary at inference-time, and explore the interpretable latent structure.


Introduction
As with many other prediction tasks involving complex structured outputs, such as image segmentation (Chen et al., 2018), machine transla-tion (Bahdanau et al., 2015), and speech recognition (Hinton et al., 2012), deep neural networks (DNNs) for sequence labeling and shallow parsing have become standard tools for for information extraction (Collobert et al., 2011;Lample et al., 2016). In the language of structured prediction, DNNs process the input sequence to produce a rich local parametrization for the output prediction model. However, output variables obey a variety of hard and soft constraints -for example, in sequence tagging tasks such as named entity recognition, I-PER cannot follow B-ORG.
Interestingly, even with such powerful local featurization, the DNN model does not automatically capture a mode of the output distribution through local decisions alone, and can violate these constraints. Successful applications of DNNs to sequence tagging gain from incorporating a simple linear chain probabilistic graphical model to enforce consistent output predictions (Collobert et al., 2011;Lample et al., 2016), and more generally the addition of a graphical model to enforce output label consistency is common practice for other tasks such as image segmentation (Chen et al., 2018).
Previous work in DNN-featurized sequence tagging with graphical models for information extraction has limited its output structure modeling to these simple local Markovian dependencies. In this work, we explore the addition of latent variables to the prediction model, and through a parsimonious factorized parameter structure, perform representation learning of hidden state embeddings in the graphical model, complementary to the standard practice of representation learning in the local potentials of the segmentation model. By factorizing the log-potentials of the hidden state transition matrices, we are able to learn large numbers of hidden states without overfitting, while the latent dynamics add the capability to learn global constraints on the overall prediction, without sacrificing efficient exact inference.
While soft and hard global constraints have a rich history in sequence tagging (Koo et al., 2010;Rush and Collins, 2012;Anzaroot et al., 2014), they have been underexplored in the context of neural-network based feature extraction models. In response, we present a latent-variable CRF model with a novel mechanism for learning latent constraints without overfitting, using low-rank embeddings of large-cardinality latent variables. For example, these non-local constraints appear in fine-grained nested field extraction, which requires hierarchical consistency between the subtags of an entity. Further, information extraction and slot filling tasks often require domain specific constraints -for example, we must avoid extracting the same field multiple times. A good combination of input featurization and output modeling is needed to capture these structural dependencies.
In this work we present a method for sequence labeling in which representation learning is applied not only to inputs, but also to output space, in the form of a lightly parameterized transition function between a large number of latent states. We introduce a hidden state variable and learn the model dynamics in the hidden state space rather than the label state space. This relaxes the Markov assumption between output labels and allows the model to learn global constraints. To avoid the quadratic blowup in parameters with the size of the latent state space, we factorize the transition logpotentials into a low-rank matrix, avoiding overfitting by effectively learning parsimonious embedded representations of the latent states. While the low rank log-potential matrix does not improve test-time inference speed, we can perform exact Viterbi inference to compute the labeling sequence. Figure 1 shows an example where our model finds the correct labeling sequence while a standard DNN+CRF model fails, by obeying a global constraint learned from the training data.
We examine the performance of the Embedded-State Latent CRF on two datasets: citation extraction on the UMass Citations dataset and medical record field extraction on the CLEF dataset. We observe improved performance in both tasks, whose outputs obey complex structural dependencies that are not able to be captured by RNN featurization. Our biggest improvement comes in the medical domain, where the small training set gives our parsimonious approach to output representation learning an extra advantage.

Problem Formulation
We consider the sequence labeling task, defined as follows. Given an input text sequence with T tokens x = {x 1 , x 2 , ..., x T }, find a corresponding output sequence y = {y 1 , y 2 , ..., y T } where each output symbol y i is one of N possible output labels. There are structural dependencies between the output labels, and resolving such dependencies is necessary for good performance.

Background
The input featurization in our model is similar to previously mentioned existing methods for tagging with DNNs (Collobert et al., 2011). We represent each input token x t with a word embedding w t . We then feed the embedded sequence w = {w 1 , w 2 , ..., w T } into a bidirectional LSTM (Graves and Schmidhuber, 2005). As a result, each input x t is associated with a contextualized feature vector the left and right context at time step t of the sequence.
In this work, we concern ourselves with the mapping from these input features to a distribution over output label sequences. A straightforward solution is to use a feedforward network to map the feature vector to the corresponding label. From a probabilistic perspective, this method is equivalent to the probabilistic graphical model in Fig.2a. Here, the goal is to estimate the posterior distribution: where the joint distribution over the sequence is fully factorized, i.e. there is no structural dependency between y t and the distribution P (y t | x t ) is parameterized by a deep neural network ψ(y t ; x t ) = f eed f orward(f t , y t ). This model ignores all the structural dependencies between the output labels during prediction, though not featurization, and has been found unsuitable for structured prediction tasks on sequences (Collobert et al., 2011).
In order to enforce some local output consistency, Collobert et al. (2011) introduce a linear chain Conditional Random Field (CRF) layer to the model (Fig.2b). They define the energy func-tion for a particular configuration as follows where the local log-potentials ψ xy are parameterized by a DNN, and (for their application) the edge log-potentials ψ yy are parameterized by an inputindependent parameter matrix, modeling the intrastate dependencies under a Markovian assumption, giving the data log-likelihood as Collobert et al. (2011) show a +1.71 performance gain in Named-Entity Recognition (NER) by explicitly enforcing these local structural dependencies. However, the Markov assumption is limiting, and much of the gain comes from enforcing deterministic hard constraints of the segmentation encoding (e.g. I-ORG cannot go after B-PER). Similar types of local gains come from hierarchical tagging schemes (e.g. I-DATE should be tagged as I-VENUE/DATE if it appears inside the I-VENUE/* segment). We would like to model, and learn, global, semantically meaningful soft constraints, e.g. BOOKTITLE should become TITLE if another TITLE does not appear in the same citation (Anzaroot et al., 2014). The state transition dynamics of the linear-chain CRF model are limited by a restriction to interaction between N output labels. The information-rich features f t ∈ R d input to the local potential are restricted to a local preference over the N labels in output space, failing to exploit the full power of the underlying feature space.

Embedded-State Latent CRF
Our proposed model, the embedded-state latent CRF, is shown in Figure 2c. We introduce a sequence of hidden states z = {z 1 , z 2 , ..., z T } where z t is one of M possible discrete hidden states and M >> N . Similarly, the corresponding energy for a particular joint configuration over y and z is where ψ xz (x t , z t ), ψ yz (y t , z t ) are the local interaction log-potentials between the input features and hidden states, and the hidden states and output states, respectively. The hidden state dynamics come from the log-scores ψ zz (z t , z t+1 ) for transitioning between hidden state z t to z t+1 . The posterior distribution over output labels can be computed by summing over all possible configurations of z The local log-potentials ψ xz (x t , z t ) are produced by an affine transform from the RNN feature extractor, and the output potentials ψ zy (z t , y t ) are many-to-one mappings from the hidden state, with learned potentials but pre-allocated numbers of states for each output label.
Factorized transition log-potentials We empirically observe that introducing a large number of hidden states can lead to overfitting, due to overparameterization of the output dependencies. For example, JOURNAL often co-occurs with PAGES but JOURNAL is not strictly accompanied by PAGES (Anzaroot et al., 2014). Therefore, we regularize the state transition log-potential with a low-rank constraint, forming an embedding matrix wherein state transition interaction scores are mediated through low-dimensional state embeddings rather than a fully unconstrained parameter matrix. Instead of learning A ∈ R M ×M , a full-rank hidden state transition potential, we learn a low-rank model A = U T V where U and V are two rank-k matrices. This reduces the number of parameters from M 2 to 2M k (where k << M ) and shares statistical strength when learning transitions between similar states.
Inference. The brute-force computation of the posterior distribution using (5) is intractable, especially with the large number of hidden states. Fortunately, both the energy and the partition function can be computed efficiently using tree belief propagation. Due to the deterministic mapping from hidden states to outputs, we can simply fold the local input and output potentials ψ xz (x t , z t ) and ψ yz (y t , z t ) into the edge potentials and perform the forward-backward algorithm as in a standard linear-chain CRF. This deterministic mapping also lets us enforce hard transition constraints while retaining exact inference. Furthermore, since our implementation is in PyTorch (Paszke et al., 2017), we only need to implement the forward pass, as automatic differentiation (back-propagation) is equivalent to the backward pass (Eisner, 2016).
MAP inference. At test time, we run the Viterbi algorithm to search for the best configuration over z rather than over y. Mapping from the hidden state z t to the output label y t is deterministic given the output state embedding.

Related Work
Much deep learning research concerns itself with learning to represent the structure of input space in a way that is highly predictive of the output. In this work, while using state-of-the-art sequence tagging baselines for input representation learning, we concern ourselves with learning the global structure of the output space of label sequences, as well as fine-grained local distinctions in output space. While representation learning in the form of fine-grained, discrete, latent state transitions in the output space has been explored in this context (e.g. various latent-variable conditional random fields Sutton et al., 2007; and latent structured support vector machines (Yu and Joachims, 2009)), we enable the use of many more hidden states without overfitting by factorizing the log-potential transition matrices and modeling the log-scores of latent state interactions as products of low-dimensional embeddings, effectively performing feature learning in output space.
A simple linear-chain CRF over the labels was used in early applications of deep learning to sequence tagging (Collobert et al., 2011), as well as the most recent high-performing segmentation models for named entity recognition (Lample et al., 2016). Outside of NLP, in tasks such as computer vision, certain classes of fully-connected graphical models over the output pixels have been used for multi-dimensional smoothing (Adams et al., 2010;Krähenbühl and Koltun, 2011), borrowing techniques for the graphics literature.
However, none of these models performs representation learning in the output space, as in the case of our proposed embedded latent-state model. Srikumar and Manning (2014) propose a similar factorized representation of output labels and their transitions, but only apply this to pairwise transitions of output labels and not latent dynamics of the whole sequence, while we believe the biggest gains are to be found by marrying representation learning techniques with latent variable methods.
In the graphical models literature, the most similar work to ours is the Latent-Dynamic CRF of , who propose the same graphical model structure, without the deep input featurization, or more importantly, the learned embedded factorization of transition scores. Additionally, that work uses a deterministic mapping of equal numbers of hidden states to output labels, while we have a hard-constrained (hidden states to output variables are always many-to-one), but learned, potential with different outputs preallocated different numbers of states based on corpus frequency.
Many graphical models have been proposed for natural language processing under hard and soft global constraints, e.g. (Koo et al., 2010;Anzaroot et al., 2014;Vilnis et al., 2015), many based on dual decomposition (Rush and Collins, 2012). However, the constraints are often fixed, and even when learned (Anzaroot et al., 2014;Vilnis et al., 2015), the learning is done simply on constraint weights generated from pre-made templates, the construction of which requires domain knowledge.
Finally, Structured Prediction Energy Networks (Belanger and McCallum, 2016;Belanger et al., 2017) have been used for NLP tasks such as semantic role labeling, but they perform approximate inference through gradient descent on a learned energy function over labelings, effectively a fully-connected graphical model, while our model sits more clearly within the framework of graphical models, permitting exact inference with only nonconvex learning, common to all latent-variable models.

Experiments
We experiment on two datasets with a rich output label space, the UMass Citations dataset (Anzaroot and McCallum, 2013) and the CLEF eHealth dataset (Suominen et al., 2015). Both of the datasets have a hierarchical label space, enforced by hard transition constraints, making this a form of shallow parsing (Anzaroot et al., 2014), with additional soft constraints in the label space due to the interdependent nature of the fields being extracted.

UMass Citations
We experiment with citation field extraction on the UMass Citations dataset (Anzaroot and Mc-Callum, 2013), a collection of 2476 richly labeled citation strings, each tagged in a hierarchical manner, across a set of 38 entities demarcating both coarse-grained labeled segments, such as title, date, authors and venue, as well as fine-grained inner segments where applicable. The data follows a train/dev/test split of 1454, 655 and 367 citations, with 231085 total tokens. For example, a person's last name could be tagged as AUTHORS/PERSON/PERSON-LAST or VENUE/EDITOR/PERSON/PERSON-LAST depending on whether the person is the author of the cited TITLE or an editor of the publication VENUE. Similarly, year could be tagged as either DATE/YEAR or VENUE/DATE/YEAR depending on whether it is the cited work's publication date or the publication date of the venue of the cited work.

CLEF eHealth
We perform our second set of sequence labeling experiments on the NICTA Synthetic Nursing Handover dataset (Suominen et al., 2015) for clinical information extraction, consisting of 101 documents totaling 32122 tokens.
It is a synthetic dataset of handover records, which contain patient profiles as written by a registered nurse (RN) working in the medical ward and delivering verbal handovers to another nurse at a shift change by the patients bedside. A document is typically 100-300 words long, and the included handover information contains five coarse entities i.e, PATIENTINTRODUCTION, MYSHIFT, APPOINTMENTS, MEDICATION and FUTURECARE. Similar to the setup of the citation field extraction task described in Section 4.1.1, each of these coarse categories has a further level of nested finer labels and the entities to be identified are all hierarchical in nature. For example, the PATIENTINTRO-DUCTION section contains entities such as PATIENTINTRODUCTION/LASTNAME and PA-TIENTINTRODUCTION/UNDERDR LASTNAME, the APPOINTMENTS section contains APPOINT-MENT/PROCEDURE CLINICIANLASTNAME, and MEDICATION contains MEDICATION/DOSAGE and MEDICATION/MEDICINE. There are a total of 35 such fine-grained entities. In addition to the hard-constrained hierarchical structure of the labels, the task also exhibits interesting global constraints, such as only tagging the first occurrence of the patient's gender, or the convention of labeling the most brief description of a nurse's shift status as MYSHIFT/STATUS, while the details of the shift are tagged as MYSHIFT/OTHER. In such cases, information extraction benefits from modeling output label dependencies, as we show in the results section.

Training Details
Our baseline is the BiLSTM+CRF model from Lample et al. (2016), employing a bidirectional LSTM with 500 hidden units for input featurization to capture long-range dependencies in the input space. Since we do not focus on input featurization, we do not use character-level embeddings in the baseline model. Both the baseline model and our EL-CRF model were implemented in PyTorch. For training our models, we use the hyper-parameter settings from the LSTM+CRF model of Lample et al. (2016). Although, we did explore different optimizer techniques to enhance SGD such as Adam (Kingma and Ba, 2015), Averaged SGD (Polyak and Juditsky, 1992) and YellowFin (Zhang et al., 2017), none of them performed as well as mini-batch SGD with a batch-size of 1. We also employed gradient clipping to a norm of 5.0, a learning rate of 0.01, learning rate decay of 0.05, dropout with p = 0.5, and early stopping, tuned on the citation development data. We initialized our word level embeddings using pre-trained 100 dimensional Glove embeddings (Pennington et al., 2014), which gave better performance on our tasks than the skip-n-gram embeddings (Ling et al., 2015) used in the original work of Lample et al. (2016). The datasets were pre-processed to zeroreplace all occurrences of numbers. Finally, we experimented with both IOBES and IOB tagging schemes, with IOB demonstrating higher performance on our tasks.
Embedding size We tune the embedding size (rank constraint) for the hidden state matrix A, varying from 10 to 40, alongside the neural network parameters, and report results when fixing the other hyperparameters and varying embedding size, similar to ablation analysis. Table 4 shows the impact of different embedding sizes on the performance of the model. We found that a size of 20 works best for both datasets, confirming the importance of the rank-constrained log-potential when using large-cardinality hidden variables.
Mapping tags to hidden states We find that the mapping between tags and hidden states greatly influences the performance of the model. We experimented with several heuristics (e.g., individual IOB tag count ratio and entity count ratio), and found that allocating a number of hidden states proportional to the entity count gives us the best performance.

Evaluation
We report field-level F1 scores as computed using the conlleval.pl script.
Since the train/validation/test splits were clearly defined for the UMass Citation dataset, we trained the models on the training split, tuned the hyperparameters on the validation split and report the scores on the test dataset. However, as there were only 101 documents in the CLEF eHealth dataset, we report the Leave-One-Out (LOO) cross-validation F1 scores for this dataset i.e., we trained 101 models each with a different held-out document, merged the respective test outputs, and computed the F1 score on this merged output. Table 1 shows that overall performance on the UMass Citation dataset using the embedded-state latent CRF (95.18) is marginally better than the baseline BiLSTM+CRF model (95.07). However, examining the entities with the largest F1 score improvement in Table 2, we see that they are mostly within the VENUE section, which has longrange constraints with other sections, giving evidence of the model's ability to learn constraints from the citation dataset.

Qualitative Analysis
In this section, we provide qualitative evidence that the embedded-state latent CRF learns constraints which are not captured by the standard CRF. First, we pick a few representative examples from the UMass Citations dataset and discuss when our model is able to correctly determine the label sequence based on the output constraints. In addition to the the hard constraints arising from hierarchical segmentation, this dataset also exhibits empirical pairwise constraints between fields e.g. two different authors' first names cannot be placed next to each other. Figure 3 demonstrates that the CRF model fails to enforce such constraints. Another constraint we observe in the citation data   Sometimes output structural dependencies are not able to resolve ambiguity in the labeling sequence.
In Figure 5 our model correctly predicts the presence of a Venue/Booktitle and a Venue/Series, but it fails to correctly assign the entity labels. The CLEF eHealth dataset holds a different set of constraints than the citation data, and its input sequences are not strong local indicators of the labeling sequence. Therefore, our model shows stronger performance over the Markovian baseline for this dataset. Some of the constraints concern the number of entities per document. For example, we only tag the first occurrence of a gender indicator e.g. he, she, her, etc., or the most general status of a nurse's shift. Finally, a T-SNE (Maaten and Hinton, 2008) clustering on the embedding vectors of the output tags, shown in Figure 7, demonstrates that output structural dependencies can be reflected in tag embedding space.

Conclusion & Future Work
We present a latent variable model which not only parametrizes local potentials with the learned features from a deep neural network, but learns embedded representations in a large hidden state space, leveraging feature learning in both the input and output representations. Experimental results demonstrate that our model can learn global structural dependencies in the presence of ambiguities that cannot be resolved by local featurization of the input sequence. We find interpretable structure in the output state embeddings.
Future work will apply our model to larger datasets with more complex dependencies, and introduce multiple latent states per time-step, enabling exponentially more expressivity in output states at the cost of exact inference. We will also explore approximate inference methods, such as expectation propagation, to speed up message passing in the regime of low-rank log-potentials.