Head-Driven Phrase Structure Grammar Parsing on Penn Treebank

Head-driven phrase structure grammar (HPSG) enjoys a uniform formalism representing rich contextual syntactic and even semantic meanings. This paper makes the first attempt to formulate a simplified HPSG by integrating constituent and dependency formal representations into head-driven phrase structure. Then two parsing algorithms are respectively proposed for two converted tree representations, division span and joint span. As HPSG encodes both constituent and dependency structure information, the proposed HPSG parsers may be regarded as a sort of joint decoder for both types of structures and thus are evaluated in terms of extracted or converted constituent and dependency parsing trees. Our parser achieves new state-of-the-art performance for both parsing tasks on Penn Treebank (PTB) and Chinese Penn Treebank, verifying the effectiveness of joint learning constituent and dependency structures. In details, we report 95.84 F1 of constituent parsing and 97.00% UAS of dependency parsing on PTB.


Introduction
Head-driven phrase structure grammar (HPSG) is a highly lexicalized, constraint-based grammar developed by (Pollard and Sag, 1994).As opposed to dependency grammar, HPSG is the immediate successor of generalized phrase structure grammar.
HPSG divides language symbols into categories of different types, such as vocabulary, phrases, etc.Each category has different grammar letter information.The complete language symbol which is a complex type feature structure represented by * Corresponding author.This paper was attribute value matrices (AVMs) includes phonological, syntactic, and semantic properties, the valence of the word and interrelationship between various components of the phrase structure.Meanwhile, the constituent structure of HPSG follows the HEAD FEATURE PRINCIPLE (HFP) (Pollard and Sag, 1994): "the head value of any headed phrase is structure-shared with the HEAD value of the head daughter.The effect of the HFP is to guarantee that headed phrases really are projections of their head daughter" (p.34).
Constituent and dependency are two typical syntactic structure representation forms, which have been well studied from both linguistic and computational perspective (Chomsky, 1981;Bresnan et al., 2015).The two formalisms carrying distinguished information have each own strengths that constituent structure is better at disclosing phrasal continuity while the dependency structure is better at indicating dependency relation among words.
Typical dependency treebanks are usually converted from constituent treebanks, though they may be independently annotated as well for the same languages.In reverse, constituent parsing can be accurately converted to dependencies representation by grammatical rules or machine learning methods (De Marneffe et al., 2006;Ma et al., 2010).Such convertibility shows a close relation between constituent and dependency representations, which also have a strong correlation with the HFP of HPSG as shown in Figure 1.Thus, it is possible to combine the two representation forms into a simplified HPSG not only for even better parsing but also for more linguistically rich representation.
In this work, we exploit both strengths of the two representation forms and combine them into HPSG.To our best knowledge, it is first attempt to perform such a formulization1 .In this paper, we explore two parsing methods for the simplified HPSG parse tree which contains both constituent and dependency syntactic information.
Our simplified HPSG will be from the annotations or conversions of Penn Treebank (PTB)2 (Marcus et al., 1993).Thus the evaluation for our HPSG parser will also be done on both the annotated constituent and converted dependency parse trees, which let our HPSG parser compare to existing constituent and dependency parsers individually.
Our experimental results show that our HPSG parser brings better prediction on both constituent and dependency tree structures.In addition, the empirical results show that our parser reaches new state-of-the-art for both parsing tasks.To sum up, we make the following contributions: • For the first time, we formulate a simplified HPSG by combining constituent and dependency tree structures.
• We propose two novel methods to handle the simplified HPSG parsing.
• Our model achieves state-of-the-art results on PTB and CTB for both constituent and dependency parsing.
The rest of the paper is organized as follows: Section 2 presents the tree structure of HPSG and two span representations.Section 3 presents our  (Miyao et al., 2004).
model based on self-attention architecture and the adopted parsing algorithms.Section 4 reports the experiments and results on PTB and CTB treebanks to evaluate our model.At last, we survey related work and conclude this paper respectively in Sections 5 and 6.
2 Simplified HPSG on PTB (Miyao et al., 2004) reports the first work of semiautomatically acquiring an English HPSG grammar from the Penn Treebank.Figure 2 demonstrates an HPSG unit presentation (formally called sign), in which head consists of the essential information.As the work of (Miyao et al., 2004) cannot demonstrate an accurate enough HPSG from the entire source constituent treebank, we focus on the core of HPSG sign, HEAD, which is conveniently connected with dependency grammar.For the purpose of accurate HPSG building, in this work, we construct a simplified HPSG only from annotations of PTB by combining constituent and dependency parse trees.

Tree Preprocessing
In standard HPSG relating to HFP, the HEAD value of any headed phrase is structure-shared with the HEAD value of the head daughter.In other words, the phrase in our simplified HPSG tree may be exactly the same as that in a constituent tree and the head word of the phrase corresponding to the parent of the head word of its children in dependency tree3 .For example, in the constituent tree of Figure 3(a), Federal Paper Board is a phrase (1, 3) assigned with category NP and in dependency tree, Board is parent of Federal and Paper, thus in our simplified HPSG tree, the head of phrase (1, 3) is Board.(5,7) (5,8) (4,8) (1,9) (1,3) (c) Joint span structure.
Figure 3: Constituent, dependency and two different simplified HPSG structures of the same sentence which is indexed from 1 to 9 and assigned interval range for each node.Dotted box represents the same part.The special category # is assigned to divide the phrase with multiple heads.Division span structure adds token H in front of the category to distinguish whether the phrase is on the left or right of the head.Thus the head is the last one of the category with H which is marked with a box.Joint span structure contains constitute phrase and dependency arc.Categ in each node represents the category of each constituent and HEAD indicates the head word.
Following most of the recent work, we apply the PTB-SD representation converted by version 3.3.0 of the Stanford parser.However, this dependency representation results in around 1% of phrases containing two or three head words.As shown in Figure 3(a), the phrase (5,8) assigned with a category NP contains 2 head words of paper and products in dependency tree.In order to deal with the problem, we introduce a special category # to divide the phrase with multiple heads meeting only one head word for each phrase.After this conversion, only 50 heads are errors in Penn Treebank.

Span Representations of HPSG
Each node in the HPSG tree noted as AVM represents compound structure.Even in our simplified HPSG, each phrase (span) should be companied with its head.To facilitate the processing of existing parsers, we propose two ways to convert the simplified HPSG into a span-style tree structure.Division Span A phrase is divided into two parts corresponding to left and right of its head.To distinguierrorsh the left and right parts, we add a special token H in front of the category to indicate the left span, in which the head of the original phrase is always the last word.Since some leaves of the tree are without category, we explicitly use a special empty category Ø for their representation, and the token H is also applied to the empty category.
As shown in Figure 3 Board category.With this operation, head information has been encoded into span boundary of a standard constituent tree and we only need to parse such a constituent tree.
Joint Span We recursively define a structure called joint span to cover both constituent and head information.A joint span consists of all its children phrases and all dependency arcs between heads of all these children phrases.For example, the HPSG node S H (1, 9) in Figure 3(c) as a joint span is: where l(i, j) denotes category of span (i, j) and d(r, h) indicates the dependency between the word r and its parent h.
At last, following the recursive definition, the entire HPSG tree T being a joint span can be represented as: As all constituent and head information has been formally encoded into a span-like structure, we can use a constituent-like parser for such a joint span tree.
3 Our Model

Overview
Using an encoder-decoder backbone, our model apply self-attention encoder (Vaswani et al., 2017) which is modified by position partition (Kitaev and Klein, 2018a).Since our two converted structures of simplified HPSG are based on the phrase, thus we can employ CKY-style (Cocke, 1969;Younger, Daniel H., 1975;Kasami, Tadao, 1965) decoder for both to find the tree with the highest predicted scores.The difference is that for division span structure, we only need span scores while for joint span structure, we need both of span and dependency scores.
Given a sentence s = {w 1 , w 2 , . . ., w n }, we attempt to predict a simplified HPSG tree.As shown in Figure 4, our parsing model includes four modules: token representation, self-attention encoder, scoring module and CKY-style decoder7 .

Token Representation
In our model, token representation x i is composed of character, word and part-of-speech (POS) embeddings.For character-level representation, we use CharLSTM (Kitaev and Klein, 2018a).For word-level representation, we concatenate randomly initialized and pre-trained word embeddings.
Finally, we concatenate character representation, word representation and POS embedding as our token representation:

Self-Attention Encoder
The encoder in our model is adapted from (Vaswani et al., 2017) and factor explicit content and position information in the self-attention process.The input matrices X = [x 1 , x 2 , . . ., x n ] in which x i is concatenated with position embedding are transformed by a self-attention encoder.We factor the model between content and position information both in self-attention sub-layer and feed-forward network, whose setting details follow (Kitaev and Klein, 2018a).

Decoder for Division Span HPSG
After reconstructing of the HPSG tree as a constituent tree with head information as described in Section 2.2, we follow the constituent parsing as (Kitaev and Klein, 2018a;Gaddy et al., 2018) to predict constituent parse tree.
Firstly, we add a special empty category Ø to spans to binarize the n-ary nodes and apply a unary atomic category to deal with the nodes of the unary chain, corresponding to nested spans with the same endpoints.
Then, we train the span scorer.Span vector s ij is the concatenation of the vector differences structed by splitting in half the outputs from the self-attention encoder.We apply one-layer feedforward networks to generate span scores vector, taking span vector s ij as input: where LN denotes Layer Normalization, g is the Rectified Linear Unit nonlinearity.The individual score of category is denoted by where [] indicates the value of corresponding the element of the score vector.The score s(T ) of the constituent parse tree T is to sum every scores of span (i, j) with category : The goal of constituent parsing is to find the tree with the highest score: T = arg max T s(T ).
We use CKY-style algorithm (Stern et al., 2017a;Gaddy et al., 2018) to obtain the tree T in O(n 3 ) time complexity.This structured prediction problem is handled with satisfying the margin constraint: where T * denotes correct parse tree and ∆ is the Hamming loss on category spans with a slight modification during the dynamic programming search.The objective function is the hinge loss, For dependency labels, following (Dozat and Manning, 2017), the classifier takes head and its children as features.We minimize the negative log probability of the correct dependency label l i for the child-parent pair (x i , h i ) implemented as cross-entropy loss: Thus, the overall loss is sum of the objectives:

Decoder for Joint Span HPSG
As our joint span is defined in a recursive way, to score the root joint span has been equally scoring all spans and dependencies in the HPSG tree.
For span scores, we continuously apply the approach and hinge loss J 1 (θ) in the previous section.For dependency scores, we predict a distribution over the possible head for each word and use the biaffine attention mechanism (Dozat and Manning, 2017) to calculate the score as follow: where α ij indicates the child-parent score, W denotes the weight matrix of the bi-linear term, U and V are the weight vectors of the linear term and b is the bias item, h i and g i are calculated by a distinct one-layer perceptron network.We minimize the negative log-likelihood of the golden dependency tree Y , which is implemented as a cross-entropy loss: where P θ (h i |x i ) is the probability of correct parent node h i for x i , and P θ (l i |x i , h i ) is the probability of the correct dependency label l i for the Algorithm 1 Joint span parsing algorithm Input: sentence leng n, span and dependency score s(i, j, ), d(r, h), 1 ≤ i ≤ j ≤ n, ∀r, h, Output: maximum value S H (T ) of tree T Initialization: To predict span and dependency scores simultaneously, we jointly train our parser for minimizing the overall loss: During testing, we propose a CKY-style algorithm as shown in Algorithm 1 to explicitly find the globally highest span and dependency score S H (T ) of our simplified HPSG tree T .In order to binarize the constituent parse tree with head, we introduce the complete span s c and the incomplete span s i which is similar to Eisner algorithm (Eisner, 1996).After finding the best score S H (T ), we backtrack the chart with split point k and sub-root r to construct the simplified HPSG tree T .
Comparing with constituent parsing CKY-style algorithm (Stern et al., 2017a), the dependency score d(r, h) in our algorithm affects the selection of best split point k.Since we need to find the best value of sub-head r and split point k, the complexity of the algorithm is O(n 5 ) time and O(n 3 ) space.To control the effect of combining span and dependency scores, we apply a weight λ: where λ in the range of 0 to 1.In addition, we can merely generate constituent or dependency parsing tree by setting λ to 1 or 0, respectively.

Experiments
In order to evaluate the proposed model, we convert our simplified HPSG tree to constituent and dependency parse trees and evaluate on two benchmark treebanks, English Penn Treebank (PTB) and Chinese Penn Treebank (CTB5.1)following standard data splitting (Zhang and Clark, 2008;Liu and Zhang, 2017b).The placeholders with the -NONE-tag are stripped from the CTB.POS tags are predicted using the Stanford tagger (Toutanova et al., 2003) and we use the same pretagged dataset as (Cross and Huang, 2016) for PTB.For CTB, we use golden POS tags for dependency parsing and predicted POS tags for constituent parsing.
For constituent parsing, we use the standard evalb8 tool to evaluate the F1 score.For dependency parsing, following (Dozat and Manning, 2017;Kuncoro et al., 2016;Ma et al., 2018), we report the results without punctuations for both treebanks.

Setup
Hyperparameters In our experiments, we use 100D GloVe (Pennington et al., 2014) and structured-skipgram (Ling et al., 2015) pre-train embeddings for English and Chinese respectively.The character representations are randomly initialized, and the dimension is 64.For self-attention encoder, we use the same hyperparameters settings as (Kitaev and Klein, 2018a).
For span scores, we apply a hidden size of 250-dimensional feed-forward networks.For dependency biaffine scores, we employ two 1024dimensional MLP layers with the ReLU as the activation function and a 1024-dimensional parameter matrix for biaffine attention.In addition, we augment our parser with ELMo (Peters et al., 2018), a larger version of BERT (Devlin et al., 2018)  XLNet (Yang et al., 2019) to compare with other pre-trained or ensemble models.We set 4 layers of self-attention for ELMo and 2 layers of self-attention for BERT or XLNet as (Kitaev and Klein, 2018a,b).Training Details we use 0.33 dropout for biaffine attention and MLP layers.All models are trained for up to 150 epochs with batch size 150 on a single NVIDIA GeForce GTX 1080Ti GPU with Intel i7-7800X CPU.We use the same training settings as (Kitaev and Klein, 2018a) and (Kitaev and Klein, 2018b).

Self-attention Layers
This subsection examines the impact of different numbers of self-attention layers varying from 8 to 16.The comparison in Table 1 indicates that the best performing setting comes from 12 selfattention layers, and more than 12 layers shows almost no promotion even reduces the accuracy.Thus the rest experiments are done with 12 layers of the self-attention encoder.

Moderating constituent and Dependency
The weight parameter λ plays an important role to balance the scoring of span and dependency.When λ set to 0, indicates only using dependency score to generate dependency tree as the general first-order dependency parsing (Eisner, 1996), while λ set to 1, shows the constituent parsing only.λ set to between 0 to 1 indicates our general simplified HPSG parsing, providing both constituent and dependency structure prediction.
The comparison in Figure 5 shows that our HPSG decoder is better than either separate constituent or dependency decoder, which shows the bonus of joint predicting constituent and dependency.Moreover, λ set to 0.5 achieves the best performance in terms of both F1 score and UAS.Table 2: English dev set performance of joint span HPSG parsing.The converted means the corresponding dependency parsing results are from the corresponding constituent parse tree using head rules.
Figure 5: Balancing constituent and dependency of joint span HPSG parsing on English dev set.

Joint Span HPSG Parsing
We compare our join span HPSG parser with a separate learning constituent parsing model which takes the same token representation and selfattention encoder on PTB dev set.The constituent parsing results are also converted into dependency ones by PTB-SD for comparison.When λ is set to 0 and 1, our joint span HPSG parser works as the dependency-only parser and constituent-only parser respectively.Table 3 shows that even in such a work mode, our HPSG parser still outperforms the separate constituent parser in terms of either constituent and dependency parsing performance.
As λ is set to 0.5, our HPSG parser will give constituent and dependency structures at the same time, which are shown better than the work alone mode of either constituent or dependency parsing.Besides, the comparison also shows that the directly predicted dependencies from our model are slightly better than those converted from the predicted constituent parse trees.

Parsing Speed
We compare the parsing speed of our parser with other neural parsers in  (Zhang and Clark, 2008;Liu and Zhang, 2017b), we report our parsing performance on both data splitting.
The comparison shows that our HPSG parsing model is more effective than learning constituent or dependency parsing separately.We also find that dependency parsing is shown much more beneficial from Joint than Division way which empirically suggests dependency score in our joint loss is helpful.
We augment our parser with ELMo, a larger version of BERT and XLNet as the sole token representation to compare with other models.Our Joint model in XLNet setting even defeats other ensemble models of both constituent and dependency parsing achieving 96.33 F1 score, 97.20% UAS and 95.72% LAS.
For fair comparison with other pre-train model on constituent parsing, we also augment our parser with Chinese larger version of RoBERTa9 as the sole token representation.Our Joint model in RoBERTa setting achieves the state of art performance of 92.55 F1 score on constituent parsing.

Related Work
In the earlier time, linguists and NLP researchers discussed how to encode lexical dependencies in phrase structures, like lexicalized tree adjoining grammar (LTAG) (Schabes et al., 1988) and headdriven phrase structure grammar (HPSG) (Pollard and Sag, 1994) which is a constraint-based highly lexicalized non-derivational generative grammar framework.
In the past decade, there was a lot of largescale HPSG-based NLP parsing systems which had been built.Such as the Enju English and Chinese parser (Miyao et al., 2004;Yu et al., 2010), the Alpino parser for Dutch (Van Noord et al., 2006), and the LKB & PET (Copestake, 2002;Callmeier, 2000) for English, German, and Japanese..Meanwhile, since HPSG represents the grammar framework in a precisely constrained way, it is difficult to broadly cover unseen real-world texts for parsing.Consequently, according to (Zhang and Krieger, 2011), many of these large-scale grammar implementations are forced to choose to either compromise the linguistic preciseness or to accept the low coverage in parsing.Previous works of HPSG approximation focus on two major approaches: grammar based approach (Kiefer and Krieger, 2004), and the corpus-driven approach (Krieger, 2007) and (Zhang and Krieger, 2011) which proposes PCFG approximation as a way to alleviate some of these issues in HPSG processing.
Since constituent and dependency share a lot of grammar and machine learning characteristics, it is a natural idea to study the relationship between constituent and dependency structures, and the joint learning of constituent and dependency parsing (Collins, 1997;Charniak, 2000;Charniak and Johnson, 2005;Farkas et al., 2011;Green and Žabokrtský, 2012;Ren et al., 2013;Yoshikawa et al., 2017).
To further exploit both strengths of the two representation forms, in this work, for the first time, we propose a graph-based parsing model that formulates constituent and dependency structures as simplified HPSG.

Conclusions
This paper presents a simplified HPSG with two different decode methods which are evaluated on both constituent and dependency parsing.Despite the usefulness of HPSG in practice and its theoretical linguistic background, our model achieves new state-of-the-art results on both Chinese and English benchmark treebanks of both parsing tasks.Thus, this work is more than proposing a high-performance parsing model by exploring the relation between constituent and dependency structures.Our experiments show that joint learning of constituent and dependency is indeed superior to separate learning mode, and combining constituent and dependency score in joint training to parse a simplified HPSG can obtain further performance improvement.

Figure 4 :
Figure 4: The framework of our joint span HPSG parsing model.

Table 3 :
Parsing speed on the PTB dataset.

Table 3 .
Although the time complexity of our Joint span model is O(n 5 ),

Table 5 :
Constituent parsing on PTB test set.

Table 6 :
Constituent parsing on CTB test set.* represents CTB dependency data splitting.