Semi-supervised Structured Prediction with Neural CRF Autoencoder

In this paper we propose an end-to-end neural CRF autoencoder (NCRF-AE) model for semi-supervised learning of sequential structured prediction problems. Our NCRF-AE consists of two parts: an encoder which is a CRF model enhanced by deep neural networks, and a decoder which is a generative model trying to reconstruct the input. Our model has a unified structure with different loss functions for labeled and unlabeled data with shared parameters. We developed a variation of the EM algorithm for optimizing both the encoder and the decoder simultaneously by decoupling their parameters. Our Experimental results over the Part-of-Speech (POS) tagging task on eight different languages, show that our model can outperform competitive systems in both supervised and semi-supervised scenarios.


Introduction
The recent renaissance of deep learning has led to significant strides forward in several AI fields. In Natural Language Processing (NLP), characterized by highly structured tasks, promising results were obtained by models that combine deep learning methods with traditional structured learning algorithms (Chen and Manning, 2014;Durrett and Klein, 2015;Andor et al., 2016;Wiseman and Rush, 2016). These models combine the strengths of neural models, that can score local decisions using a rich non-linear representation, with efficient inference procedures used to combine the local decisions into a coherent global decision. Among these models, neural variants of the Conditional Random Fields (CRF) model (Laf-ferty et al., 2001) are especially popular. By replacing the linear potentials with non-linear potential using neural networks these models were able to improve performance in several structured prediction tasks (Andor et al., 2016;Peng and Dredze, 2016;Lample et al., 2016;Ma and Hovy, 2016;Durrett and Klein, 2015).
Despite their promise, wider adoption of these algorithms for new structured prediction tasks can be difficult. Neural networks are notoriously susceptible to over-fitting unless large amounts of training data are available. This problem is exacerbated in the structured settings, as accounting for the dependencies between decisions requires even more data. Providing it through manual annotation is often a difficult labor-intensive task.
In this paper we tackle this problem, and propose an end-to-end neural CRF autoencoder (NCRF-AE) model for semi-supervised learning on sequence labeling problems.
An autoencoder is a special type of neural net, modeling the conditional probability P (X|X), where X is the original input to the model and X is the reconstructed input (Hinton and Zemel, 1994). Autoencoders consist of two parts, an encoder projecting the input to a hidden space, and a decoder reconstructing the input from it.
Traditionally, autoencoders are used for generating a compressed representation of the input by projecting it into a dense low dimensional space. In our setting the hidden space consists of discrete variables that comprise the output structure. These generalized settings are described in Figure 1a. By definition, it is easy to see that the encoder (lower half in Figure 1a) can be modeled by a discriminative model describing P (Y |X) directly, while the decoder (upper half in Figure 1a) naturally fits as a generative model, describing P (X|Y ), where Y is the label. In our model, illustrated in Figure 1b, the encoder is a CRF model with neural networks as its potential extractors, while the decoder is a generative model, trying to reconstruct the input.
Our model carries the merit of autoencoders, which can exploit valuable information from unlabeled data. Recent works (Ammar et al., 2014;Lin et al., 2015) suggested using an autoencoder with a CRF model as an encoder in an unsupervised setting. We significantly expand on these works and suggest the following contributions: 1. We propose a unified model seamlessly accommodating both unlabeled and labeled data. While past work focused on unsupervised structured prediction, neglecting the discriminative power of such models, our model easily supports learning in both fully supervised and semisupervised settings. We developed a variation of the Expectation-Maximization (EM) algorithm, used for optimizing the encoder and the decoder of our model simultaneously.
2. We increase the expressivity of the traditional CRF autoencoder model using neural networks as the potential extractors, thus avoiding the heavy feature engineering necessary in previous works. Interestingly, our model's predictions, which unify the discriminative neural CRF encoder and the generative decoder, have led to an improved performance over the highly optimized neural CRF (NCRF) model alone, even when trained in the supervised settings over the same data.
3. We demonstrate the advantages of our model empirically, focusing on the well-known Partof-Speech (POS) tagging problem over 8 different languages, including low resource languages. In the supervised setting, our NCRF-AE outperformed the highly optimized NCRF. In the semisupervised setting, our model was able to successfully utilize unlabeled data, improving on the performance obtained when only using the labeled data, and outperforming competing semisupervised learning algorithms.
Furthermore, our newly proposed algorithm is directly applicable to other sequential learning tasks in NLP, and can be easily adapted to other structured tasks such as dependency parsing or constituent parsing by replacing the forwardbackward algorithm with the inside-outside algorithm. All of these tasks can benefit from semisupervised learning algorithms. 1
In contrast to supervised latent variable models, such as the Hidden Conditional Random Fields in (Quattoni et al., 2007), which utilize additional latent variables to infer for supervised structure prediction, we do not presume any additional latent variables in our NCRF-AE model in both supervised and semi-supervised setting.
The difficulty of providing sufficient supervision has motivated work on semi-supervised and unsupervised learning for many of these tasks (McClosky et al., 2006;Spitkovsky et al., 2010;Subramanya et al., 2010;Stratos and Collins, 2015;Marinho et al., 2016;Tran et al., 2016), including several that also used autoencoders (Ammar et al., 2014;Lin et al., 2015;Miao and Blunsom, 2016;Kociský et al., 2016;Cheng et al., 2017). In this paper we expand on these works, and suggest a neural CRF autoencoder, that can leverage both labeled and unlabeled data.

Neural CRF Autoencoder
In semi-supervised learning the algorithm needs to utilize both labeled and unlabeled data. Autoencoders offer a convenient way of dealing with both types of data in a unified fashion.
A generalized autoencoder ( Figure 1a) tries to reconstruct the inputX given the original input X, aiming to maximize the log probability P (X|X) without knowing the latent variable Y explicitly. Since we focus on sequential structured prediction problems, the encoding and decoding processes are no longer for a single data point (x, y) (x if unlabeled), but for the whole input instance and output sequence (x, y) (x if unlabeled). Additionally, as our main purpose in this study is to reconstruct the input with precision,x is just a copy of x.
The neural CRF autoencoder model in this work. Figure 1: On the left is a generalized autoencoder, of which the lower half is the encoder and the upper half is the decoder. On the right is an illustration of the graphical model of our NCRF-AE model. The yellow squares are interactive potentials among labels, and the green squares represent the unary potentials generated by the neural networks.
As shown in Figure 1b, our NCRF-AE model consists of two parts: the encoder (the lower half) is a discriminative CRF model enhanced by deep neural networks as its potential extractors with encoding parameters Λ, describing the probability of a predicted sequence of labels given the input; the decoder (the upper half) is a generative model with reconstruction parameters Θ, modeling the probability of reconstructing the input given a sequence of labels. Accordingly, we present our model mathematically as follows: where P Λ (y|x) is the probability given by the neural CRF encoder, and P Θ (x|y) is the probability produced by the generative decoder.
When making a prediction, the model tries to find the most probable output sequence by performing the following inference procedure using the Viterbi algorithm: To clarify, as we focus on POS tagging problems in this study, in the unsupervised setting where the true POS tags are unknown, the labels used for reconstruction are actually the POS tags being induced. The labels induced here are corespoding to the hidden nodes in a generalized autoencoder model.

Neural CRF Encoder
In a CRF model, the probability of predicted labels y, given sequence x as input is modeled as is the partition function that marginalize over all possible assignments to the predicted labels of the sequence, and Φ(x, y) is the scoring function, which is defined as: The partition function Z can be computed efficiently via the forward-backward algorithm. The term φ(x, y t ) corresponds to the score of a particular tag y t at position t in the sequence, and ψ(y t−1 , y t ) represents the score of transition from the tag at position t − 1 to the tag at position t.
In our NCRF-AE model, φ(x, y t ) is described by deep neural networks while ψ(y t−1 , y t ) by a transition matrix. Such a structure allows for the use of distributed representations of the input, for instance, the word embeddings on a continuous vector space (Mikolov et al., 2013).
Typically in our work, φ(x, y t ) is modeled jointly by a multi-layer perceptron (MLP) that utilizes the word-level information, and a bidirectional long-short term memory (LSTM) neural network (Hochreiter and Urgen Schmidhuber, 1997) that captures the character level information within each word. A bi-directional structure can extract character level information from both directions, with which we expect to catch the prefix and suffix information of words in an endto-end system, rather than using hand-engineered features. The bi-directional LSTM neural network consumes character embeddings e c ∈ R k 1 as input, where k 1 is the dimensionality of the character embeddings. A normal LSTM can be denoted as: where denotes element-wise multiplication. Then a bi-directional LSTM neural network extends it as follows, by denoting the procedure of generating h t as H: , where e ct here is the character embedding for character c in position t in a word.
The inputs to the MLP are word embeddings e v ∈ R k 2 for each word v, where k 2 is the dimensionality of the vector, concatenated with the final representation generated by the bi-directional LSTM over the characters of that In order to leverage the capacity of the CRF model, we use a word and its context together to generate the unary potential. More specifically, we adopt a concatenation v t = [u t−(w−1)/2 ; · · · ; u t−1 ; u t ; u t+1 ; · · · ; u t+(w−1)/2 ] as the inputs to the MLP model, where t denotes the position in a sequence, and w being an odd number indicates the context size. Further, in order to enhance the generality of our model, we add a dropout layer on the input right before the MLP layer as a regularizer. Notice that different from a normal MLP, the activation function of the last layer is no more a softmax function, but a linear function generates the log-linear part φ t (x, y t ) of the CRF model: The transition score ψ(y t−1 , y t ) is a single scalar representing the interactive potential. We use a transition matrix Ψ to cover all the transitions between different labels, and Ψ is part of the encoder parameters Λ.
All the parameters in the neuralized encoder are updated when the loss function is minimized via error back-propagation through all the structures of the neural networks and the transition matrix.
The detailed structure of the neural CRF encoder is demonstrated in Fig 2. Note that the MLP layer is also interchangeable with a recurrent neural network (RNN) layer or LSTM layer. But in our pilot experiments, we found a single MLP structure yields better performance, which we conjecture is due to over-fitting caused by the high complexity of those alternatives.  Figure 2: A demonstration of the neural CRF encoder. l t and r t are the output of the forward and backward character-level LSTM of the word at position t in a sentence, and e t is the word-level embedding of that word. u t is the concatenation of e t , l t and r t , denoted by blue dashed arrows.

Generative Decoder
In our NCRF-AE, we assume the generative process follows several multinomial distributions: each label y has the probability θ y→x to reconstruct the corresponding word x, i.e., P (x|y) = θ y→x . This setting naturally leads to a constraint x θ y→x = 1. The number of parameters of the decoder is |Y| × |X |. For a whole sequence, the reconstruction probability is P Θ (x|y) = t P (x t |y t ).

A Unified Learning Framework
We first constructed two loss functions for labeled and unlabeled data using the same model. Our model is trained in an on-line fashion: given a labeled or unlabeled sentence, our NCRF-AE optimizes the loss function by choosing the corresponding one. In an analogy to coordinate descent, we optimize the loss function of the NCRF-AE by alternatively updating the parameters Θ in the decoder and the parameters Λ in the encoder. The parameters Θ in the decoder are updated via a variation of the Expectation-Maximization (EM) algorithm, and the the parameters Λ in the encoder are updated through a gradient-based method due to the non-convexity of the neuralized CRF. In contrast to the early autoencoder models (Ammar et al., 2014;Lin et al., 2015), our model has two distinctions: First, we have two loss functions to model labeled example and unlabeled examples; Second, we designed a variant of EM algorithm to alternatively learn the parameters of the encoder and the decoder at the same time.

Unified Loss Functions for Labeled and unlabeled Data
For a sequential input with labels, the complete data likelihood given by our NCRF-AE is where s t (x, y) = log P (x t |y t ) + φ(x, y t ) + ψ(y t−1 , y t ).
If the input sequence is unlabeled, we can simply marginalize over all the possible assignment to labels. The probability is formulated as .
Our formulation have two advantages. First, term U is different from but in a similar form as term Z, such that to calculate the probability P (x|x) for an unlabeled sequence, the forwardbackward algorithm to compute the partition function Z can also be applied to compute U efficiently. Second, our NCRF-AE highlights a unified structure of different loss functions for labeled and unlabeled data with shared parameters. Thus during training, our model can address both labeled and unlabeled data well by alternating the loss functions. Using negative log-likelihood as our loss function, if the data is labeled, the loss function is: If the data is unlabeled, the loss function is: Thus, during training, based on whether the encountered data is labeled or unlabeled, our model can select the appropriate loss function for learning parameters. In practice, we found for labeled data, using a combination of loss l and loss u actually yields better performance.

Mixed Expectation-Maximization Algorithm
The Expectation-Maximization (EM) algorithm (Dempster et al., 1977) was applied to a wide range of problems. Generally, it establishes a lower-bound of the objective function by using Jensen's Inequality. It first tries to find the posterior distribution of the latent variables, and then based on the posterior distribution of the latent variables, it maximizes the lower-bound. By alternating expectation (E) and maximization (M) steps, the algorithm iteratively improves the lowerbound of the objective function.
In this section we describe the mixed Expectation-Maximization (EM) algorithm used in our study. Parameterized by the encoding parameters Λ and the reconstruction parameters Θ, our NCRF-AE consists of the encoder and the decoder, which together forms the log-likelihood a highly non-convex function. However, a careful observation shows that if we fix the encoder, the lower bound derived in the E step, is convex with respect to the reconstruction parameters Θ in the M step. Hence, in the M step we can analytically obtain the global optimum of Θ. In terms of the reconstruction parameters Θ by fixing Λ, we describe our EM algorithm in iteration t as follows: In the E-step, we let Q(y i ) = P (y i |x i ,x i ), and treat y i the latent variable as it is not observable in unlabeled data. We derive the lower-bound of the log-likelihood using Q(y i ): where Q(y i ) is computed using parameters Θ (t−1) in the previous iteration t − 1.
In the M-step, we try to improve the aforementioned lower-bound using all examples: s.t.
In this formulation, const is a constant with respect to the parameters we are updating. Q(y) is the distribution of a label y at any position by marginalizing labels at all other positions in a sequence. By denoting C(y, x) as the number of times that (x, y) co-occurs, E y∼Q Θ (t−1) [C(y, x)] is the expected count of a particular reconstruction at any position, which can also be calculated using Baum-Welch algorithm (Welch, 2003), and can be summed over for all examples in the dataset (In the labeled data, it is just a real count). The algorithm we used to calculate the expected count is described in Algorithm 1. Therefore, it can be shown that the aforementioned global optimum can be calculated by simply normalizing the expected counts. In terms of the encoder's parameters Λ, they are first updated via a gradient-based optimization before each EM iteration. Based on the above discussion, our Mixed EM Algorithm is presented in Algorithm 2.

Algorithm 1 Obtain Expected Count (T e )
Require: the expected count table T e 1: for an unlabeled data example x i do 2: Compute the forward messages: α(y, t) ∀y, t.
t is the position in a sequence.

4:
Calculate the expected count for each x in x i : P (y t |x t ) ∝ α(y, t) × β(y, t).

5:
T e (x t , y t ) ← T e (x t , y t ) + P (y t |x t ) T e is the expected count Train the encoder on labeled data {x, y} l and unlabeled data {x} u to update Λ (t−1) to Λ (t) .

5:
Re-initialize expected count table T e with 0s.

6:
Use labeled data {x, y} l to calculate real counts and update T e .

7:
Use unlabeled data {x} u to compute the expected counts with parameters Λ (t) and Θ (t−1) and update T e .

8:
Obtain Θ (t) globally and analytically based on T e . 9: end for This mixed EM algorithm is a combination of the gradient-based approach to optimize the encoder by minimizing the negative log-likelihood as the loss function, and the EM approach to update the decoder's parameters by improving the lower-bound of the log-likelihood.

Experimental Settings
Dataset We evaluated our model on the POS tagging task, in both the supervised and semisupervised learning settings, over eight different languages from the UD (Universal Dependencies) 1.4 dataset (Mcdonald et al., 2013). The task is defined over 17 different POS tags, used across the different languages. We followed the original English French German Italian Russian Spanish Indonesian Croatian  Tokens  254830 391107 293088 272913 99389  423346  121923  139023  Training  12543  14554  14118  12837  4029  14187  4477  5792  Development 2002  1596  799  489  502  1552  559  200  Testing  2077  298  977  489  499  274  297  297   Table 1: Statistics of different UD languages used in our experiments, including the number of tokens, and the number of sentences in training, development and testing set respectively.
UD division for training, development and testing in our experiments. The statistics of the data used in our experiments are described in table 1. The UD dataset includes several low-resource languages which are of particular interest to our semisupervised model.

Input Representation and Neural Architecture
Our model uses word embeddings as input. In our pilot experiments, we compared the performance on the English dataset of the pre-trained embedding from Google News (Mikolov et al., 2013) and the embeddings we trained directly on the UD dataset using the skip-gram algorithm (Mikolov et al., 2013). We found these two types of embeddings yield very similar performance on the POS tagging task. So in our experiments, we used embeddings of different languages directly trained on the UD dataset as input to our model, whose dimension is 200. For the MLP neural network layer, the number of hidden nodes in the hidden layer is 20, which is the same for the hidden layer in the character-level LSTM. The dimension of the character-level embeddings sent into the LSTM layer is 15, which is randomly initialized. In order to incorporate the global information of the input sequence, we set the context window size to 3. The dropout rate for the dropout layer is set to 0.5.
Learning We used ADADELTA (Zeiler, 2012) to update parameters Λ in the encoder, as ADADELTA dynamically adapts learning rate over time using only first order information and has minimal computational overhead beyond vanilla stochastic gradient descent (SGD). The authors of ADADELTA also argue this method appears robust to noisy gradient information, different model architecture choices, various data modalities and selection of hyper-parameters. We observed that ADADELTA indeed had faster convergence than vanilla SGD optimization. In our experiments, we include word embeddings and character embeddings as parameters as well. We used Theano to implement our algorithm, and all the experiments were run on NVIDIA GPUs. To prevent over-fitting, we used the "early-stop" strategy to determine the appropriate number of epochs during training. We did not take efforts to tune those hyper-parameters and they remained the same in both our supervised and semi-supervised learning experiments.

Supervised Learning
In these settings our Neural CRF autoencoder model had access to the full amount of annotated training data in the UD dataset. As described in Section 5, the decoder's parameters Θ were estimated using real counts from the labeled data. We compared our model with existing sequence labeling models including HMM, CRF, LSTM and neural CRF (NCRF) on all the 8 languages. Among these models, the NCRF can be most directly compared to our model, as it is used as the base of our model, but without the decoder (and as a result, can only be used for supervised learning).
The results, summarized in Table 2, show that our NCRF-AE consistently outperformed all other systems, on all the 8 languages, including Russian, Indonesian and Croatian which had considerably less data compared to other languages. Interestingly, the NCRF consistently came second to our model, which demonstrates the efficacy of the expressivity added to our model by the decoder, together with an appropriate optimization approach.
To better understand the performance difference by different models, we performed error analysis, using an illustrative example, described in Figure 3. In this example, the LSTM incorrectly predicted the POS tag of the word "search" as a verb, instead of a noun (part of the NP "nice search engine"), while predicting correctly the preceding word, "nice", as an adjective. We attribute the error to LSTM lacking an explicit output transition scoring function, which would penalize the ungrammatical transition between "ADJ" and "VERB".   Table 3: Semi-supervised learning accuracy of POS tagging on 8 UD languages. HEM means hard-EM, used as a self-training approach, and OL means only 20% of the labeled data is used and no unlabeled data is used.
Text Google is a nice search engine . The NCRF, which does score such transitions, correctly predicted that word. However, it incorrectly predicted "Google" as a noun rather than a proper-noun. This is a subtle mistake, as the two are grammatically and semantically similar. This mistake appeared consistently in the NCRF results, while NCRF-AE predictions were correct.
We attribute this success to the superior expressivity of our model: The prediction is done jointly by the encoder and the decoder, as the reconstruction decision is defined over all output sequences, picking the jointly optimal sequence. From another perspective, our NCRF-AE model is a combination of discriminative and generative models, in that sense the decoder can be regarded as a soft constraint that supplements the encoder. Such that, the decoder performs as a regularizer to check-balance the choices made by the encoder.

Semi-supervised Learning
In the semi-supervised settings we compared our models with other semi-supervised structured prediction models. In addition, we studied how varying the amount of unlabeled data would change the performance of our model.
As described in Sec. 5, the decoder's parameters Θ are initialized by the labeled dataset using real counts and updated in training.

Varying Unlabeled Data Proportion
We first experimented with varying the proportion of unlabeled data, while fixing the amount of labeled data. We conducted these experiments over two languages, English and low-resource language Croatian. We fixed the proportion of labeled data at 20%, and gradually added more unlabeled data from 0% to 20% (from full supervision to semisupervision). The unlabeled data was sampled from the same dataset (without overlapping with the labeled data), with the labels removed. The results are shown in Figure 4.
The left most point of both sub-figures is the accuracy of fully supervised learning with 20% of the whole data. As we can observe, the tagging accuracy increased as the proportion of unlabeled data increased.  Figure 4: UD English and Croatian POS tagging accuracy versus increasing proportion of unlabeled sequences using 20% labeled data. The green straight line is the performance of the neural CRF, trained over the labeled data.

Semi-supervised POS Tagging on Multiple Languages
We compared our NCRF-AE model with other semi-supervised learning models, including the HMM-EM algorithm and the hard-EM version of our NCRF-AE. The hard EM version of our model can be considered as a variant of self-training, as it infers the missing labels using the current model in the E-step, and uses the real counts of these labels to update the model in the M-step. To contextualize the results, we also provide the results of the NCRF model and the supervised version our NCRF-AE model trained on 20% of the data. We set the proportion of labeled data to 20% for each language and set the proportion of unlabeled data to 50% of the dataset. There was no overlap between labeled and unlabeled data. The results are summarized in Table 3. Similar to the supervised experiments, the supervised version of our NCRF-AE, trained over 20% of the labeled data, outperforms the NCRF model. Our model was able to successfully use the unlabeled data, leading to improved performance in all languages, over both the supervised version of our model, as well as the HMM-EM and Hard-EM models that were also trained over both the labeled and unlabeled data.

Varying Sizes of Labeled Data on English
As is known to all, semi-supervised approaches tend to work well when given a small size of labeled training data. But with the increase of labeled training data size, we might get diminishing effectiveness. To verify this conjecture, we conducted additional experiments to show how varying sizes of labeled training data affect the effectiveness of our NCRF-AE model. In these exper-  iments, we gradually increased the proportion of labeled data, and in accordance decreased the proportion of unlabeled data. The results of these experiments are demonstrated in Figure 5. As we speculated, we observed diminishing effectiveness when increasing the proportion of labeled data in training.

Conclusion
We proposed an end-to-end neural CRF autoencoder (NCRF-AE) model for semi-supervised sequence labeling. Our NCRF-AE is an integration of a discriminative model and generative model which extends the generalized autoencoder by using a neural CRF model as its encoder and a generative decoder built on top of it. We suggest a variant of the EM algorithm to learn the parameters of our NCRF-AE model. We evaluated our model in both supervised and semi-supervised scenarios over multiple languages, and show it can outperform other supervised and semi-supervised methods. Additional experiments suggest how varying sizes of labeled training data affect the effectiveness of our model.
These results demonstrate the strength of our model, as it was able to utilize the small amount of labeled data and exploit the hidden information from the large amount of unlabeled data, without additional feature engineering which is often needed in order to get semi-supervised and weakly-supervised systems to perform well. The superior performance on the low resource language also suggests its potential in practical use.