Globally Normalized Transition-Based Neural Networks

We introduce a globally normalized transition-based neural network model that achieves state-of-the-art part-of-speech tagging, dependency parsing and sentence compression results. Our model is a simple feed-forward neural network that operates on a task-specific transition system, yet achieves comparable or better accuracies than recurrent models. We discuss the importance of global as opposed to local normalization: a key insight is that the label bias problem implies that globally normalized models can be strictly more expressive than locally normalized models.


Introduction
Neural network approaches have taken the field of natural language processing (NLP) by storm. In particular, variants of long short-term memory (LSTM) networks (Hochreiter and Schmidhuber, 1997) have produced impressive results on some of the classic NLP tasks such as part-ofspeech tagging , syntactic parsing (Vinyals et al., 2015) and semantic role labeling (Zhou and Xu, 2015). One might speculate that it is the recurrent nature of these models that enables these results.
In this work we demonstrate that simple feedforward networks without any recurrence can achieve comparable or better accuracies than LSTMs, as long as they are globally normalized. Our model, described in detail in Section 2, uses a transition system (Nivre, 2006) and feature embeddings as introduced by Chen and Manning (2014). We do not use any recurrence, but perform beam search for maintaining multiple hy- * On leave from Columbia University. potheses and introduce global normalization with a conditional random field (CRF) objective (Bottou et al., 1997;Le Cun et al., 1998;Lafferty et al., 2001;Collobert et al., 2011) to overcome the label bias problem that locally normalized models suffer from. Since we use beam inference, we approximate the partition function by summing over the elements in the beam, and use early updates (Collins and Roark, 2004;. We compute gradients based on this approximate global normalization and perform full backpropagation training of all neural network parameters based on the CRF loss. In Section 3 we revisit the label bias problem and the implication that globally normalized models are strictly more expressive than locally normalized models. Lookahead features can partially mitigate this discrepancy, but cannot fully compensate for it-a point to which we return later. To empirically demonstrate the effectiveness of global normalization, we evaluate our model on part-of-speech tagging, syntactic dependency parsing and sentence compression (Section 4). Our model achieves state-of-the-art accuracy on all of these tasks, matching or outperforming LSTMs while being significantly faster. In particular for dependency parsing on the Wall Street Journal we achieve the best-ever published unlabeled attachment score of 94.61%.
As discussed in more detail in Section 5, we also outperform previous structured training approaches used for neural network transition-based parsing. Our ablation experiments show that we outperform  and  because we do global backpropagation training of all model parameters, while they fix the neural network parameters when training the global part of their model. We also outperform  despite using a smaller beam. To shed additional light on the label bias problem in practice, we provide a sentence compression example where the local model completely fails. We then demonstrate that a globally normalized parsing model without any lookahead features is almost as accurate as our best model, while a locally normalized model loses more than 10% absolute in accuracy because it cannot effectively incorporate evidence as it becomes available.
Finally, we provide an open-source implementation of our method, called SyntaxNet, 1 which we have integrated into the popular TensorFlow 2 framework.
We also provide a pre-trained, state-of-the art English dependency parser called "Parsey McParseface," which we tuned for a balance of speed, simplicity, and accuracy.

Model
At its core, our model is an incremental transitionbased parser (Nivre, 2006). To apply it to different tasks we only need to adjust the transition system and the input features.

Transition System
Given an input x, most often a sentence, we define: • A set of states S(x). • A special start state s † ∈ S(x). • A set of allowed decisions A(s, x) for all s ∈ S(x). • A transition function t(s, d, x) returning a new state s for any decision d ∈ A(s, x).
We will use a function ρ(s, d, x; θ) to compute the score of decision d in state s for input x. The vector θ contains the model parameters and we assume that ρ(s, d, x; θ) is differentiable with respect to θ. In this section, for brevity, we will drop the dependence of x in the functions given above, simply writing S, A(s), t(s, d), and ρ(s, d; θ).
Throughout this work we will use transition systems in which all complete structures for the same input x have the same number of decisions n(x) (or n for brevity). In dependency parsing for example, this is true for both the arc-standard and arc-eager transition systems (Nivre, 2006), where for a sentence x of length m, the number of decisions for any complete parse is n(x) = 2 × m. 3 A complete structure is then a sequence of decision/state pairs (s 1 , d 1 ) . . . (s n , d n ) such that s 1 = s † , d i ∈ S(s i ) for i = 1 . . . n, and s i+1 = t(s i , d i ). We use the notation d 1:j to refer to a decision sequence d 1 . . . d j .
We assume that there is a one-to-one mapping between decision sequences d 1:j−1 and states s j : that is, we essentially assume that a state encodes the entire history of decisions. Thus, each state can be reached by a unique decision sequence from s † . 4 We will use decision sequences d 1:j−1 and states interchangeably: in a slight abuse of notation, we define ρ(d 1:j−1 , d; θ) to be equal to ρ(s, d; θ) where s is the state reached by the decision sequence d 1:j−1 .
The scoring function ρ(s, d; θ) can be defined in a number of ways. In this work, following Chen and Manning (2014), , and , we define it via a feedforward neural network as Here θ (l) are the parameters of the neural network, excluding the parameters at the final layer. θ (d) are the final layer parameters for decision d. φ(s; θ (l) ) is the representation for state s computed by the neural network under parameters θ (l) . Note that the score is linear in the parameters θ (d) . We next describe how softmax-style normalization can be performed at the local or global level.

Global vs. Local Normalization
In the Chen and Manning (2014) style of greedy neural network parsing, the conditional probability distribution over decisions d j given context d 1:j−1 is defined as where 4 It is straightforward to extend the approach to make use of dynamic programming in the case where the same state can be reached by multiple decision sequences. Each Z L (d 1:j−1 ; θ) is a local normalization term. The probability of a sequence of decisions d 1:n is . ( Beam search can be used to attempt to find the maximum of Eq. (2) with respect to d 1:n . The additive scores used in beam search are the logsoftmax of each decision, ln p(d j |d 1:j−1 ; θ), not the raw scores ρ(d 1:j−1 , d j ; θ).
In contrast, a Conditional Random Field (CRF) defines a distribution p G (d 1:n ) as follows: where Z G (θ) = Beam search can again be used to approximately find the argmax.

Training
Training data consists of inputs x paired with gold decision sequences d * 1:n . We use stochastic gradient descent on the negative log-likelihood of the data under the model. Under a locally normalized model, the negative log-likelihood is whereas under a globally normalized model it is A significant practical advantange of the locally normalized cost Eq. (4) is that the local partition function Z L and its derivative can usually be computed efficiently. In contrast, the Z G term in Eq. (5) contains a sum over d 1:n ∈ D n that is in many cases intractable.
To make learning tractable with the globally normalized model, we use beam search and early updates (Collins and Roark, 2004;. As the training sequence is being decoded, we keep track of the location of the gold path in the beam. If the gold path falls out of the beam at step j, a stochastic gradient step is taken on the following objective: Here the set B j contains all paths in the beam at step j, together with the gold path prefix d * 1:j . It is straightforward to derive gradients of the loss in Eq. (6) and to back-propagate gradients to all levels of a neural network defining the score ρ(s, d; θ). If the gold path remains in the beam throughout decoding, a gradient step is performed using B n , the beam at the end of decoding.

The Label Bias Problem
Intuitively, we would like the model to be able to revise an earlier decision made during search, when later evidence becomes available that rules out the earlier decision as incorrect. At first glance, it might appear that a locally normalized model used in conjunction with beam search or exact search is able to revise earlier decisions. However the label bias problem (see Bottou (1991), Collins (1999) pages 222-226, Lafferty et al. (2001), Bottou and LeCun (2005), Smith and Johnson (2007)) means that locally normalized models often have a very weak ability to revise earlier decisions.
This section gives a formal perspective on the label bias problem, through a proof that globally normalized models are strictly more expressive than locally normalized models. The theorem was originally proved 5 by Smith and Johnson (2007).
The example underlying the proof gives a clear illustration of the label bias problem. 6 Global Models can be Strictly More Expressive than Local Models Consider a tagging problem where the task is to map an input sequence x 1:n to a decision sequence d 1:n . First, consider a locally normalized model where we restrict the scoring function to access only the first i input symbols x 1:i when scoring decision d i . We will return to this restriction soon. The scoring function ρ can be an otherwise arbitrary function of the tuple d 1:i−1 , d i , x 1:i : .
Second, consider a globally normalized model This model again makes use of a scoring function ρ(d 1:i−1 , d i , x 1:i ) restricted to the first i input symbols when scoring decision d i .
Define P L to be the set of all possible distributions p L (d 1:n |x 1:n ) under the local model obtained as the scores ρ vary. Similarly, define P G to be the set of all possible distributions p G (d 1:n |x 1:n ) under the global model. Here a "distribution" is a function from a pair (x 1:n , d 1:n ) to a probability p(d 1:n |x 1:n ). Our main result is the following: Theorem 3.1 See also Smith and Johnson (2007). P L is a strict subset of P G , that is P L P G .
To prove this we will first prove that P L ⊆ P G . This step is straightforward. We then show that P G P L ; that is, there are distributions in P G that are not in P L . The proof that P G P L gives a clear illustration of the label bias problem.
Proof that P L ⊆ P G : We need to show that for any locally normalized distribution p L , we can construct a globally normalized model p G such 6 Smith and Johnson (2007) cite Michael Collins as the source of the example underlying the proof. Note that the theorem refers to conditional models of the form p(d1:n|x1:n) with global or local normalization. Equivalence (or non-equivalence) results for joint models of the form p(d1:n, x1:n) are quite different: for example results from Chi (1999) and Abney et al. (1999) imply that weighted context-free grammars (a globally normalized joint model) and probabilistic context-free grammars (a locally normalized joint model) are equally expressive. that p G = p L . Consider a locally normalized model with scores ρ(d 1: Then it is easily verified that p G (d 1:n |x 1:n ) = p L (d 1:n |x 1:n ) for all x 1:n , d 1:n .
In proving P G P L we will use a simple problem where every example seen in training or test data is one of the following two tagged sentences: Note that the input x 2 = b is ambiguous: it can take tags B or D. This ambiguity is resolved when the next input symbol, c or e, is observed. Now consider a globally normalized model, where the scores ρ(d 1:i−1 , d i , x 1:i ) are defined as follows.
Define T as the set , (e, E)} of (word, tag) pairs seen in the data. We define where α is the single scalar parameter of the model, and π = 1 if π is true, 0 otherwise. Proof that P G P L : We will construct a globally normalized model p G such that there is no locally normalized model such that p L = p G .
Under the definition in Eq. (8), it is straightforward to show that In contrast, under any definition for The inequality p L (B|A, a b) + p L (D|A, a b) ≤ 1 then immediately implies Eq. (9). It follows that for sufficiently large values of α, we have p G (A B C|a b c) + p G (A D E|a b e) > 1, and given Eq. (9) it is impossible to define a locally normalized model with Under the restriction that scores ρ(d 1:i−1 , d i , x 1:i ) depend only on the first i input symbols, the globally normalized model is still able to model the data in Eq. (7), while the locally normalized model fails (see Eq. 9). The ambiguity at input symbol b is naturally resolved when the next symbol (c or e) is observed, but the locally normalized model is not able to revise its prediction.
It is easy to fix the locally normalized model for the example in Eq. (7) by allowing scores ρ(d 1:i−1 , d i , x 1:i+1 ) that take into account the input symbol x i+1 . More generally we can have a model of the form ρ(d 1:i−1 , d i , x 1:i+k ) where the integer k specifies the amount of lookahead in the model. Such lookahead is common in practice, but insufficient in general. For every amount of lookahead k, we can construct examples that cannot be modeled with a locally normalized model by duplicating the middle input b in (7) k + 1 times. Only a local model with scores ρ(d 1:i−1 , d i , x 1:n ) that considers the entire input can capture any distribution p(d 1:n |x 1:n ): in this case the decomposition p L (d 1:n |x 1:n ) = n i=1 p L (d i |d 1:i−1 , x 1:n ) makes no independence assumptions.
However, increasing the amount of context used as input comes at a cost, requiring more powerful learning algorithms, and potentially more training data. For a detailed analysis of the tradeoffs between structural features in CRFs and more powerful local classifiers without structural constraints, see Liang et al. (2008); in these experiments local classifiers are unable to reach the performance of CRFs on problems such as pars-ing and named entity recognition where structural constraints are important. Note that there is nothing to preclude an approach that makes use of both global normalization and more powerful scoring functions ρ(d 1:i−1 , d i , x 1:n ), obtaining the best of both worlds. The experiments that follow make use of both.

Experiments
To demonstrate the flexibility and modeling power of our approach, we provide experimental results on a diverse set of structured prediction tasks. We apply our approach to POS tagging, syntactic dependency parsing, and sentence compression.
While directly optimizing the global model defined by Eq. (5) works well, we found that training the model in two steps achieves the same precision much faster: we first pretrain the network using the local objective given in Eq. (4), and then perform additional training steps using the global objective given in Eq. (6). We pretrain all layers except the softmax layer in this way. We purposefully abstain from complicated hand engineering of input features, which might improve performance further (Durrett and Klein, 2015).
We use the training recipe from  for each training stage of our model. Specifically, we use averaged stochastic gradient descent with momentum, and we tune the learning rate, learning rate schedule, momentum, and early stopping time using a separate held-out corpus for each task. We tune again with a different set of hyperparameters for training with the global objective.

Part of Speech Tagging
Part of speech (POS) tagging is a classic NLP task, where modeling the structure of the output is important for achieving state-of-the-art performance.

Data & Evaluation.
We conducted experiments on a number of different datasets: (1) the English Wall Street Journal (WSJ) part of the Penn Treebank (Marcus et al., 1993) with standard POS tagging splits; (2) the English "Treebank Union" multi-domain corpus containing data from the OntoNotes corpus version 5 (Hovy et al., 2006), the English Web Treebank (Petrov and McDonald, 2012), and the updated and corrected Question Treebank (Judge et al., 2006) with identical setup to ; and (3) the CoNLL '09 multi-lingual shared task (Hajič et al., 2009).
Model Configuration. Inspired by the integrated POS tagging and parsing transition system of Bohnet and Nivre (2012), we employ a simple transition system that uses only a SHIFT action and predicts the POS tag of the current word on the buffer as it gets shifted to the stack. We extract the following features on a window ±3 tokens centered at the current focus token: word, cluster, character n-gram up to length 3. We also extract the tag predicted for the previous 4 tokens. The network in these experiments has a single hidden layer with 256 units on WSJ and Treebank Union and 64 on CoNLL'09.
Results. In Table 1 we compare our model to a linear CRF and to the compositional characterto-word LSTM model of . The CRF is a first-order linear model with exact inference and the same emission features as our model. It additionally also has transition features of the word, cluster and character n-gram up to length 3 on both endpoints of the transition. The results for  were solicited from the authors.
Our local model already compares favorably against these methods on average. Using beam search with a locally normalized model does not help, but with global normalization it leads to a 7% reduction in relative error, empirically demonstrating the effect of label bias. The set of character ngrams feature is very important, increasing average accuracy on the CoNLL'09 datasets by about 0.5% absolute. This shows that characterlevel modeling can also be done with a simple feed-forward network without recurrence.

Dependency Parsing
In dependency parsing the goal is to produce a directed tree representing the syntactic structure of the input sentence.
Data & Evaluation. We use the same corpora as in our POS tagging experiments, except that we use the standard parsing splits of the WSJ. To avoid over-fitting to the development set (Sec. 22), we use Sec. 24 for tuning the hyperparameters of our models. We convert the English constituency trees to Stanford style dependencies (De Marneffe et al., 2006) using version 3.3.0 of the converter. For English, we use predicted POS tags (the same POS tags are used for all models) and exclude punctuation from the evaluation, as is standard. For the CoNLL '09 datasets we follow standard practice and include all punctuation in the evaluation. We follow  and use our own predicted POS tags so that we can include a k-best tag feature (see below) but use the supplied predicted morphological features. We report unlabeled and labeled attachment scores (UAS/LAS).
Model Configuration. Our model configuration is basically the same as the one originally proposed by Chen and Manning (2014) and then refined by . In particular, we use the arc-standard transition system and extract the same set of features as prior work: words, part of speech tags, and dependency arcs and labels in the surrounding context of the state, as well as k-best tags as proposed by . We use two hidden layers of 1,024 dimensions each. Tables 2 and 3 show our final parsing results and a comparison to the best systems from the literature. We obtain the best ever published results on almost all datasets, including the WSJ. Our main results use the same pre-trained word embeddings as  and , but no tri-training. When we artificially restrict ourselves to not use pre-trained word embeddings, we observe only a modest drop of ∼0.5% UAS; for example, training only on the WSJ yields 94.08% UAS and 92.15% LAS for our global model with a beam of size 32.

Results.
Even though we do not use tri-training, our model compares favorably to the 94.26% LAS and 92.41% UAS reported by  with tri-training. As we show in Sec. 5, these gains can be attributed to the full backpropagation training that differentiates our approach from that of  and . Our results also significantly outperform the LSTM-based approaches of  and .

Sentence Compression
Our final structured prediction task is extractive sentence compression.
Data & Evaluation. We follow Filippova et al. (2015), where a large news collection is used to heuristically generate compression instances. Our final corpus contains about 2.3M compression instances: we use 2M examples for training, 130k for development and 160k for the final test. We report per-token F1 score and per-sentence accuracy (A), i.e. percentage of instances that fully match the golden compressions. Following Filippova et al. (2015) we also run a human evaluation on 200 sentences where we ask the raters to score compressions for readability (read) and informativeness (info) on a scale from 0 to 5.
Model Configuration. The transition system for sentence compression is similar to POS tagging: we scan sentences from left-to-right and label each token as keep or drop. We extract features from words, POS tags, and dependency labels from a window of tokens centered on the input, as well as features from the history of predictions. We use a single hidden layer of size 400.  Results. Table 4 shows our sentence compression results. Our globally normalized model again significantly outperforms the local model. Beam search with a locally normalized model suffers from severe label bias issues that we discuss on a concrete example in Section 5. We also compare to the sentence compression system from Filippova et al. (2015), a 3-layer stacked LSTM which uses dependency label information. The LSTM and our global model perform on par on both the automatic evaluation as well as the human ratings, but our model is roughly 100× faster. All compressions kept approximately 42% of the tokens on average and all the models are significantly better than the automatic extractions (p < 0.05).

Discussion
We derived a proof for the label bias problem and the advantages of global models. We then emprirically verified this theoretical superiority by demonstrating state-of-the-art performance on three different tasks. In this section we situate and compare our model to previous work and provide two examples of the label bias problem in practice.

Related Neural CRF Work
Neural network models have been been combined with conditional random fields and globally normalized models before. Bottou et al. (1997) andLe Cun et al. (1998) describe global training of neural network models for structured prediction problems. Peng et al. (2009) add a non-linear neural network layer to a linear-chain CRF and Do and Artires (2010) apply a similar approach to more general Markov network structures. Yao et al. (2014) and Zheng et al. (2015) introduce recurrence into the model and  finally combine CRFs and LSTMs. These neural CRF models are limited to sequence labeling tasks where exact inference is possible, while our model works well when exact inference is intractable.

Related Transition-Based Parsing Work
For early work on neural-networks for transitionbased parsing, see Henderson (2003;2004). Our work is closest to the work of ,  and Watanabe and Sumita (2015); in these approaches global normalization is added to the local model of Chen and Manning (2014). Empirically,  achieves the best performance, even though their model keeps the parameters of the locally normalized neural network fixed and only trains a perceptron that uses the activations as features. Their model is therefore limited in its ability to revise the predictions of the locally normalized model. In Ta  propagation the CRF accuracy is 0.2% higher and training converged more than 4× faster.  perform full backpropagation training like us, but even with a much larger beam, their performance is significantly lower than ours. We also apply our model to two additional tasks, while they experiment only with dependency parsing. Finally, Watanabe and Sumita (2015) introduce recurrent components and additional techniques like max-violation updates for a corresponding constituency parsing model. In contrast, our model does not require any recurrence or specialized training.

Label Bias in Practice
We observed several instances of severe label bias in the sentence compression task. Although using beam search with the local model outperforms greedy inference on average, beam search leads the local model to occasionally produce empty compressions (Table 6). It is important to note that these are not search errors: the empty compression has higher probability under p L than the prediction from greedy inference. However, the more expressive globally normalized model does not suffer from this limitation, and correctly gives the empty compression almost zero probability.
We also present some empirical evidence that the label bias problem is severe in parsing. We trained models where the scoring functions in parsing at position i in the sentence are limited to considering only tokens x 1:i ; hence unlike the full parsing model, there is no ability to look ahead in the sentence when making a decision. 7 The result for a greedy model under this constraint  is 76.96% UAS; for a locally normalized model with beam search is 81.35%; and for a globally normalized model is 93.60%. Thus the globally normalized model gets very close to the performance of a model with full lookahead, while the locally normalized model with a beam gives dramatically lower performance. In our final experiments with full lookahead, the globally normalized model achieves 94.01% accuracy, compared to 93.07% accuracy for a local model with beam search. Thus adding lookahead allows the local model to close the gap in performance to the global model; however there is still a significant difference in accuracy, which may in large part be due to the label bias problem. A number of authors have considered modified training procedures for greedy models, or for locally normalized models. Daumé III et al. (2009) introduce Searn, an algorithm that allows a classifier making greedy decisions to become more robust to errors made in previous decisions. Goldberg and Nivre (2013) describe improvements to a greedy parsing approach that makes use of methods from imitation learning (Ross et al., 2011) to augment the training set. Note that these methods are focused on greedy models: they are unlikely to solve the label bias problem when used in conjunction with beam search, given that the problem is one of expressivity of the underlying model. More recent work (Yazdani and Henderson, 2015;Vaswani and Sagae, 2016) has augmented locally normalized models with correctness probabilities or error states, effectively adding a step after every decision where the probability of correctness of the resulting structure is evaluated. This gives considerable gains over a locally normalized model, although performance is lower than our full globally normalized approach.

Conclusions
We presented a simple and yet powerful model architecture that produces state-of-the-art results for POS tagging, dependency parsing and sentence compression. Our model combines the flexibility of transition-based algorithms and the modeling power of neural networks. Our results demonstrate that feed-forward network without recurrence can outperform recurrent models such as LSTMs when they are trained with global normalization. We further support our empirical findings with a proof showing that global normalization helps the model overcome the label bias problem from which locally normalized models suffer.