Expected F-Measure Training for Shift-Reduce Parsing with Recurrent Neural Networks

We present expected F-measure training for shift-reduce parsing with RNNs, which enables the learning of a global parsing model optimized for sentence-level F1. We apply the model to CCG parsing, where it improves over a strong greedy RNN baseline, by 1 . 47% F1, yielding state-of-the-art results for shift-reduce CCG parsing.


Introduction
Shift-reduce parsing is a popular parsing paradigm, one reason being the potential for fast parsers based on the linear number of parsing actions needed to analyze a sentence (Nivre and Scholz, 2004;Sagae and Lavie, 2006;Zhang and Clark, 2011;Goldberg et al., 2013;Zhu et al., 2013;Xu et al., 2014). Recent work has shown that by combining distributed representations and neural network models (Chen and Manning, 2014), accurate and efficient shift-reduce parsing models can be obtained with little feature engineering, largely alleviating the feature sparsity problem of linear models.
In practice, the most common objective for optimizing neural network shift-reduce parsing models is maximum likelihood. In the greedy search setting, the log-likelihood of each target action is maximized during training, and the most likely action is committed to at each step of the parsing process during inference (Chen and Manning, 2014;. In the beam search setting, Zhou et al. (2015) show that sentence-level likelihood, together with contrastive learning (Hinton, 2002), can be used to derive a global model which incorporates beam search at both training and inference time (Zhang and Clark, 2008), giving significant accuracy gains over a fully greedy model. However, despite the effectiveness of optimizing likelihood, it is often desirable to directly optimize for task-specific metrics, which often leads to higher accuracies for a variety of models and applications (Goodman, 1996;Och, 2003;Smith and Eisner, 2006;Rosti et al., 2010;Auli and Lopez, 2011;He and Deng, 2012;. In this paper, we present a global neural network parsing model, optimized for a task-specific loss based on expected F-measure. The model naturally incorporates beam search during training, and is globally optimized, to learn shift-reduce action sequences that lead to parses with high expected Fscores. In contrast to Auli and Lopez (2011), who optimize a CCG parser for F-measure via softmaxmargin (Gimpel and Smith, 2010), we directly optimize an expected F-measure objective, derivable from only a set of shift-reduce action sequences and sentence-level F-scores. More generally, our method can be seen as an alternative approach for training a neural beam search parsing model (Watanabe and Sumita, 2015;Weiss et al., 2015;Zhou et al., 2015), combining the benefits of global learning and taskspecific optimization.
We also introduce a simple recurrent neural network (RNN) model to shift-reduce parsing on which the greedy baseline and the global model is based. Compared with feed-forward networks, RNNs have the potential to capture and use an unbounded history, and they have been used to learn explicit representations for parser states as well as actions performed on the stack and queue in shift-reduce parsers Watanabe and Sumita, 2015), following Miikkulainen (1996) and Mayberry and Miikkulainen (1999). In comparison, our model is a natural extension of the feed-forward architecture in Manning (2014) using Elman RNNs (Elman, 1990).
We apply our models to CCG, and evaluate the resulting parsers on standard CCGBank data (Hockenmaier and Steedman, 2007). More specifically, by combining the global RNN parsing model with a bidirectional RNN CCG supertagger that we have developed ( §4) -building on the supertagger of Xu et al. (2015), we obtain accuracies higher than the shift-reduce CCG parsers of Zhang and Clark (2011) and Xu et al. (2014). Finally, although we choose to focus on shift-reduce parsing for CCG, we expect the methods to generalize to other shift-reduce parsers.

RNN Models
In this section, we start by describing the baseline model, which is also taken as the pretrained model to train the global model ( §2.4). We abstract away from the details of CCG and present the models in a canonical shift-reduce parsing framework (Aho and Ullman, 1972), which is henceforth assumed: partially constructed derivations are maintained on a stack, and a queue stores remaining words from the input string; the initial parse item has an empty stack and no input has been consumed on the queue. Parsing proceeds by applying a sequence of shift-reduce actions to transform the input until the queue has been exhausted and no more actions can be applied.

Model
Our recurrent neural network model is a standard Elman network (Elman, 1990) which is factored into an input layer, a hidden layer with recurrent connections, and an output layer. Similar to Chen and Manning (2014), the input layer x t encodes stack and queue contexts of a parse item through concatenation of feature embeddings. The output layer y t represents a probability distribution over possible parser actions for the current item.
The current state of the hidden layer is determined by the current input and the previous hidden layer state. The weights between the layers are repre-sented by a number of matrices: matrix U contains weights between the input and hidden layers, V contains weights between the hidden and output layers, and W contains weights between the previous hidden layer and the current hidden layer.
The hidden and output layers at time step t are computed via a series of vector-matrix products and non-linearities: are sigmoid 1 and softmax functions, respectively.

Feature Embeddings
Given a parse item, we first extract features using a set of predefined feature templates; each template belongs to a feature type f (such as word or POS tag), which has an associated look-up table, denoted as L f , to project a feature to its distributed representation; and L f ∈ R n f ×d f , where n f is the vocabulary size of feature type f and d f is its embedding dimension. The embedding for a concrete feature is obtained by retrieving the corresponding row from L f . At time step t, the input layer x t is: x t = [e f 1,1 ; . . . ; e f 1,|f 1 | ; . . . ; e f k,1 ; . . . ; e f k,|f k | ], where "; " denotes concatenation, |f k | is the number of feature templates for the k th feature type and x t ∈ R 1×(d f 1 |f 1 |+...+d f k |f k |) . For each feature type, a special embedding is used for unknown features.

Greedy Training
To train a greedy model, we extract gold-standard actions from the training data and minimize crossentropy loss with stochastic gradient descent (SGD) using backpropagation through time (BPTT; Rumelhart et al., 1988). Similar to Chen and Manning (2014), we compute the softmax over only feasible actions at each step.
Unfortunately, although we use an RNN, which keeps a representation of previous parse items in its hidden state and has the potential to capture longterm dependencies, the resulting model is still fully greedy: a locally optimal action is taken at each step given the current input x t and the previous hidden state h t−1 . Therefore, once a sub-optimal action has been committed to by the parser at any step, it has no means to recover and has to continue from that mistake. Such mistakes accumulate until the goal is reached, and they are referred to as search errors.
In order to enlarge the search space of the greedy model thereby alleviating some search errors, we experiment with applying beam search decoding during inference; and we observe some accuracy improvements by taking the highest scored action sequence as the output (Table 3). However, since the greedy model itself is only optimized locally, as expected, the improvements diminish after a certain beam size. Instead, we show below that by using the greedy model weights as a starting point, we can train a global model optimized for an expected Fmeasure loss, which gives further significant accuracy improvements ( §5).

Expected F1 Training
The RNN we use to train the global model has the same Elman architecture as the greedy model. Given the greedy model, we summarize its weights as θ = {U, V, W} and initialize the weights of the global model to θ, and training proceeds as follows: 1. We use a beam-search decoder to parse a sentence x n in the training data and let the decoder generate a k-best list 2 of output parses using the current θ, denoted as Λ(x n ). Similar to other structured training approaches that use inexact beam search (Zhang and Clark, 2008;Weiss et al., 2015;Watanabe and Sumita, 2015;Zhou et al., 2015), Λ(x n ) is as an approximation to the set of all possible parses of an input sentence.
2. Let y i be the shift-reduce action sequence of a parse in the k-best list Λ(x n ), and let |y i | be its total number of actions and y ij be the j th action in y i , for 1 ≤ j ≤ |y i |. We compute the log-linear action sequence score of y i , ρ(y i ), as a sum of individual action scores in that 2 We do not put a limit on k, and whenever an item is finished, it is appended to the k-best list. We found the size of the k-best lists were on average twice the size of a given beam size. sequence: ρ(y i ) = |y i | j=1 log s θ (y ij ), where s θ (y ij ) is the softmax action score of y ij given by the RNN model. For each y i , we also compute its sentence-level F1 using the set of labeled, directed dependencies, denoted as ∆, associated with its parse item. (We assume F1 over labeled, directed dependencies is also the parser evaluation metric.) 3. We compute the negative expected F1 objective (-xF1, defined below) for x n using the scores obtained in the above step and minimize this objective using SGD (maximizing the expected F1 for x n ). These three steps repeat for other sentences in the training data, updating θ after processing each sentence, and training iterates in epochs until convergence.
We note that the above process is different from parse reranking (Collins, 2000;Charniak and Johnson, 2005), in which Λ(x n ) would stay the same for each x n in the training data across all epochs, and a reranker is trained on all fixed Λ(x n ); whereas the xF1 training procedure is on-line learning with parameters updated after processing each sentence and each Λ(x n ) is generated with a new θ.
More formally, we define the loss J(θ), which incorporates all action scores in each action sequence, and all action sequences in Λ(x n ), for each x n as where F1(∆ y i , ∆ G xn ) is the sentence level F1 of the parse derived by y i , with respect to the gold-standard dependency structure ∆ G xn of x n ; p(y i |θ) is the normalized probability score of the action sequence y i , computed as To apply SGD, we derive the error gradients used for backpropagation. First, by applying the chain rule to J(θ), we have is the standard softmax gradients. Next, to compute δ y ij , which are the error gradients propagated from the loss to the softmax layer, we rewrite the loss in (1) as and by simplifying: Finally, using (2) and (3) plus the above simplifications, the error term δ y ij can be derived using the quotient rule: which has a simple closed form.
A naive implementation of the xF1 training procedure would backpropagate the error gradients individually for each y i in Λ(x n ). To make it efficient, we observe that the unfolded network in the beam containing all y i becomes a DAG (with one hidden state leading to one or more resulting hidden states) and apply backpropagation through structure (Goller and Kuchler, 1996) to obtain the gradients.

Shift-Reduce CCG Parsing
We explain the application of the RNN models to CCG by first describing the CCG mechanisms used in our parser, followed by details of the shift-reduce transition system.

Combinatory Categorial Grammar
A lexicon, together with a set of CCG rules, formally constitute a CCG. The former defines a mapping from words to sets of lexical categories representing syntactic types, and the latter gives schemas which dictate whether two categories can be combined. Given the lexicon and the rules, the syntactic types of complete constituents can be obtained by recursive combination of categories using the rules.
More generally, both lexical and non-lexical CCG categories can be either atomic or complex: atomic categories are categories without any slashes, and complex categories are constructed recursively from atomic ones using forward (/) and backward slashes (\) as two binary operators. As such, all categories can be represented as follows (Vijay-Shanker and Weir, 1993;Kuhlmann and Satta, 2014): where m ≥ 0, α is an atomic category, | 1 , . . . , | m ∈ {\, /} and z i are meta-variables for categories. CCG rules have the following two schematic forms, each a generalized version of functional composition (Vijay-Shanker and Weir, 1993): The first schematic form above instantiates into a forward application rule (>) for m = 0, and forward composition rules (> B ) for m > 0. Similarly, the second schematic form, which is symmetric to the first, instantiates into backward application (<) and composition (< B ) rules. Fig.1 shows an example CCG derivation. All the rule instances in this derivation are instantiated from Given CCGBank (Hockenmaier and Steedman, 2007), there are two approaches to extract a grammar from this data. The first is to treat all CCG derivations as phrase-structure trees, and a binary, context-free "cover" grammar, consisting of all CCG rule instances in the treebank, is extracted from local trees in all the derivations (Fowler and Penn, 2010; Zhang and Clark, 2011). In contrast, one can extract the lexicon from the treebank and define only the rule schemas, without explicitly enumerating any rule instances (Hockenmaier, 2003). This is the approach taken in the C&C parser (Clark and Curran, 2007) and the one we use here. Moreover, following Zhang and Clark (2011), our CCG parsing model is also a normal-form model, which models action sequences of normal-form derivations in CCGBank.
• REDUCE (re) combines the top two subtrees s 0 and s 1 on the stack using a CCG rule (s 1 s 0 → x) and replaces them with a subtree rooted in x. It also appends the set of newly created dependencies on x, denoted as x , to ∆.
• UNARY (un) applies either a type-raising or type-changing rule (s 0 → x) to the stack-top element and replaces it with a unary subtree rooted in x.
The deduction system (Fig. 2) of our shift-reduce parser follows from the transition system. 4 Each parse item is associated with a step indicator ω, which denotes the number of actions used to build it. Given a sentence of length n, a full derivation requires 2n − 1 + µ steps to terminate, where µ is the total number of un actions applied. In Zhang and Clark (2011), a finish action is used to indicate termination, which we do not use in our parser: an item finishes when no further action can be taken. Another difference between the transition systems is that Zhang and Clark (2011) omit the ∆ field in each parse item, due to their use of a context-free, phrasestructure cover, and dependencies are recovered at a post-processing step; in our system, we build dependencies as parsing proceeds. s 0 .w s 1 .w s 2 .w s 3 .w s.w 0 s . w 1 s.w 2 s.w 3 s 0 .l.w s 1 .l.w s o .r.w s 1 .r.w q 0 .w q 1 .w q 2 .w q 3 .w s 0 .c s 0 .l.c s 0 .r.c s 1 .c s 1 .l.c s 1 .r.c s 2 .c s 3 .c

RNN CCG Parsing
We use the same set of CCG rules as in Clark and Curran (2007) and the total number of output units in our RNN model is equal to the number of lexical categories (i.e., all possible sh actions), plus 10 units for re 5 and 18 units for un actions.
All features in our model fall into three types: word, POS tag and CCG category. Table 1 shows the atomic feature templates and we have |f w | = 16, |f p | = 16 and |f c | = 8 (all word-based features are generalized to POS features). Each template has two parts: the first part denotes parse item context and the second part denotes the feature type. s denotes stack contexts and q denotes queue contexts; e.g., s 0 is the top subtree on the stack, and s o .l is its left child. w represents head words of constituents and w 0 is the right-most word of the input string that has been shifted onto the stack.

Bidirectional Supertagging
We extend the RNN supertagging model of Xu et al. (2015) by using a bidirectional RNN (BRNN). The BRNN processes an input in both directions with two separate hidden layers, which are then fed to one output layer to make predictions. At each time step t, we compute the forward hidden state h t for t = (0, 1, . . . , n − 1); the backward hidden state h t is computed similarly but from the reverse direction for t = (n − 1, n − 2, . . . , 0) as and the output layer, for t = (0, 1, . . . , n − 1), is computed as The BRNN introduces two new parameter matrices U and W and replaces the old hidden-to-output 5 In principle, only 1 re unit is needed, but we use 9 additional units to handle non-standard CCG rules in the treebank. matrix V with V to take two hidden layers as input. We use the same three feature embedding types as Xu et al. (2015), namely word, suffix and capitalization, and all features are extracted from a context window size of 7 surrounding the current word.

Experiments
Setup. All experiments were performed on CCG-Bank (Hockenmaier and Steedman, 2007) with the standard split. 6 We used the C&C supertagger (Clark and Curran, 2007) and the RNN supertagger model of Xu et al. (2015) as two supertagger baselines. For the parsing experiments, the baselines were the shift-reduce CCG parsers of Zhang and Clark (2011) and Xu et al. (2014) and the C&C parser of (Clark and Curran, 2007).
To train the RNN parser, we used 10-fold cross validation for both POS tagging and supertagging. For both development and test parsing experiments, we used the C&C POS tagger and automatically assigned POS tags. The BRNN supertagging model was used as the supertagger by all RNN parsing models for both training and testing. F-score over directed, labeled CCG predicate-argument dependencies was used as the parser evaluation metric, obtained using the script from C&C.

Hyperparameters.
For the BRNN supertagging model, we used identical hyperparameter settings as in Xu et al. (2015). For all RNN parsing models, the weights were uniformly initialized using the interval [−2.0, 2.0], and scaled by their fanin (Bengio, 2012); the hidden layer size was 220, and 50-dimensional embeddings were used for all feature types and scaled Turian embeddings were used (Turian et al., 2010) for word embeddings. We also pretrained CCG lexcial category and POS embeddings by using the GENSIM word2vec implementation. 7 The data used for this was obtained by parsing a Wikipedia dump using the C&C parser and concatenating the output with CCGBank Sections 02-21. Embeddings for unknown words and CCG categories outside of the lexical category set were uniformly initialized ([−2.0, 2.0]) without scaling.  To train all the models, we used a fixed learning rate of 0.0025 and did not truncate the gradients for BPTT, except for training the greedy RNN parsing model where we used a BPTT step size of 9. We applied dropout at the input layer (Legrand and Collobert, 2015), with a dropout rate of 0.25 for the supertagger and 0.30 for the parser. Table 2 shows 1-best supertagging results. The MaxEnt C&C supertagger uses POS tag features and a tag dictionary, neither of which are used by the RNN supertaggers. For all supertaggers, the same set of 425 lexical categories is used (Clark and Curran, 2007). On the test set, our BRNN supertagger achieves a 1-best accuracy of 93.52%, an absolute improvement of 0.52% over the RNN model, demonstrating the usefulness of contextual information from both input directions. Fig. 3a shows multi-tagging accuracy comparison for the three supertaggers by varying the variablewidth beam probability cut-off value β for each supertagger. The β value determines the average number of supertags (ambiguity) assigned to each word by pruning supertags whose probabilities are not within β times the probability of the 1-best supertag; for this experiment we used β values ranging from 0.09 to 2 × 10 −4 and it can be seen that the BRNN supertagger consistently achieves better accuracies at similar ambiguity levels.

Supertagging Results
Finally, all shift-reduce CCG parsers mentioned in this paper take multi-tagging output obtained with a fixed β for training and testing; and in general, a smaller β value can be used by a shift-reduce CCG parser than by the C&C parser. This is because a β value too small may explode the dynamic program of the C&C parser, and it thus relies on an adaptive supertagging strategy (Clark and Curran, 2007), by starting from a large β value and backing off to smaller values if no spanning analysis can found with the current β.  Table 3: The effect on dev F1 by varying the beam size and supertagger β value for the greedy RNN model.

Parsing Results
To pretrain the greedy model, we trained 10 crossvalidated BRNN supertagging models to supply supertags for the parsing model, and used a supertagger β value of 0.00025 which gave on average 5.02 supertags per word. We ran SGD training for 60 epochs, observing no accuracy gains after that, and the best greedy model was obtained after the 52 nd epoch (Fig. 3b).
Furthermore, we found that using a relatively smaller supertagger β value (higher ambiguity) for training, and a larger β value (lower ambiguity) for testing, resulted in more accurate models; and we chose the final β value used for the greedy model to be 0.09 using the dev set (Table 3). This observation was different from Zhang and Clark (2011) and Xu et al. (2014), which are two shift-reduce CCG parsers using the averaged perceptron and beam search (Collins, 2002;Collins and Roark, 2004;Zhang and Clark, 2008): they used the same β values for training and testing, which resulted in lower accuracy for our greedy model. Table 3 also shows the effect on dev F1 by using different beam sizes at test time for the greedy model: with b = 6, we obtained an accuracy of 85.02%, an improvement of 0.41% over b = 1 (with a β value of 0.09); we saw accuracy gains up to b = 8 (with very minimal gains with b = 16 for β values 0.06 and 0.07), after which the accuracy started to drop. F1 on dev with b = 6 across all training epochs are shown in Fig. 3b as well, and the best model was obtained after the 43 rd epoch.
For the xF1 model, we used b = 8 and a supertagger β value of 0.09 for both training and testing.   Table 4: Final parsing results on Section 00 and Section 23 (100% coverage). Zhang and Clark (2011)* is a reimplementation of the original. All speed results (sents/sec) are obtained using Section 23 and precomputation is used for all RNN parsers. LP (labeled precision); LR (labeled recall); LF (labeled F-score over CCG dependencies); CAT (lexical category assignment accuracy). All experiments using auto POS. higher than that of the greedy model with b = 1 and 0.71% higher than the greedy model with b ∈ {6, 8}. This result improves over shift-reduce CCG models of Zhang and Clark (2011) and Xu et al. (2014) by 0.73% and 0.55%, respectively (Table 4). Table 4 summarizes final results. 8 RNN-xF1, the xF1 trained beam-search model, is currently the most accurate shift-reduce CCG parser, achieving a final F-score of 86.42%, and gives an F-score improvement of 1.47% over the greedy RNN baseline. We show the results for the model of Xu et al. (2014) for reference only, since it uses a more sophisticated dependency, rather than normal-form derivation, model.
At test time, we also used the precomputation trick of Devlin et al. (2014) to speed up the RNN models by caching the top 20K word embeddings 8 The C&C parser fails to produce spanning analyses for a very small number of sentences (Clark and Curran, 2007) on both dev and test sets, which is not the case for any of the shiftreduce parsers; and for brevity, we omit C&C coverage results. and all POS embeddings, 9 and this made the greedy RNN parser more than 3 times faster than the C&C parser (all speed experiments were measured on a workstation with an Intel Core i7 4.0GHz CPU). 10

Related Work
Optimizing for Task-specific Metrics. Our training objective is largely inspired by task-specific optimization for parsing and MT. Goodman (1996) proposed algorithms for optimizing a parser for various constituent matching criteria, and it was one of the earliest work that we are aware of on optimizing a parser for evaluation metrics. Smith and Eisner (2006) proposed a framework for minimizing expected loss for log-linear models and applied it to dependency parsing by optimizing for labeled attachment scores, although they obtained little per-formance improvements. Auli and Lopez (2011) optimized the C&C parser for F-measure. However, they used the softmax-margin (Gimpel and Smith, 2010) objective, which required decomposing precision and recall statistics over parse forests. Instead, we directly optimize for an F-measure loss. In MT, task-specific optimization has also received much attention (e.g., see Och (2003)). Closely related to our work, Gao and He (2013) proposed training a Markov random field translation model as an additional component in a log-linear phrase-based translation system using a k-best list based expected BLEU objective; using the same objective,  and  trained a large scale phrase-based reordering model and a RNN language model respectively, all as additional components within a log-linear translation model. In contrast, our RNN parsing model is trained in an end-toend fashion with an expected F-measure loss and all parameters of the model are optimized using backpropagation and SGD.
Parsing with RNNs. A line of work is devoted to parsing with RNN models, including using RNNs (Miikkulainen, 1996;Mayberry and Miikkulainen, 1999;Legrand and Collobert, 2015;Watanabe and Sumita, 2015) and LSTM (Hochreiter and Schmidhuber, 1997) RNNs Kiperwasser and Goldberg, 2016). Legrand and Collobert (2015) used RNNs to learn conditional distributions over syntactic rules;  explored sequenceto-sequence learning (Sutskever et al., 2014) for parsing;  utilized characterlevel representations and Kiperwasser and Goldberg (2016) built an easy-first dependency parser using tree-structured compositional LSTMs. However, all these parsers use greedy search and are trained using the maximum likelihood criterion (except Kiperwasser and Goldberg (2016), who used a margin-based objective). For learning global models, Watanabe and Sumita (2015) used a marginbased objective, which was not optimized for the evaluation metric; although not using RNNs, Weiss et al. (2015) proposed a method using the averaged perceptron with beam search (Collins, 2002;Collins and Roark, 2004;Zhang and Clark, 2008), which required fixing the neural network representations, and thus their model parameters were not learned using end-to-end backpropagation.
Finally, a number of recent work (Bengio et al., 2015;Vaswani and Sagae, 2016) explored training neural network models for parsing and other tasks such that the network learns from the oracle as well as its own predictions, and are hence more robust to search errors during inference. In principle, these techniques are largely orthogonal to both global learning and task-based optimization, and we would expect further accuracy gains are possible by combining these techniques in a single model.

Conclusion
Neural network shift-reduce parsers are often trained by maximizing likelihood, which does not optimize towards the final evaluation metric. In this paper, we addressed this problem by developing expected F-measure training for an RNN shift-reduce parsing model. We have demonstrated the effectiveness of our method on shift-reduce parsing for CCG, achieving higher accuracies than all shift-reduce CCG parsers to date and the de facto C&C parser. 11 We expect the general framework will be applicable to models using other types of neural networks such as feed-forward or LSTM nets, and to shift-reduce parsers for constituent and dependency parsing.