Advances in Using Grammars with Latent Annotations for Discontinuous Parsing

We present new experiments that transfer techniques from Probabilistic Context-free Grammars with Latent Annotations (PCFG-LA) to two grammar formalisms for discontinuous parsing: linear context-free rewriting systems and hybrid grammars. In particular, Dirichlet priors during EM training, ensemble models, and a new nonterminal scheme for hybrid grammars are evaluated. We find that our grammars are more accurate than previous approaches based on discontinuous grammar formalisms and early instances of the discriminative models but inferior to recent discriminative parsers.


Introduction
Many tasks in natural language processing, such as machine translation, information extraction, and sentiment analysis, benefit from syntactic analysis (Culotta and Sorensen, 2004;Ding and Palmer, 2005;Duric and Song, 2011). Often syntax is represented by means of constituents. Languages with a flexible word order such as German require constituents that are discontinuous, i.e., constituents that cover words which do not constitute a continuous interval in the sentence. To this end, generalizations of context-free grammars such as tree adjoining grammars (Joshi et al., 1975) and linear context-free rewriting systems (LCFRS, Vijay-Shanker et al., 1987) have been proposed. Although the parsing complexity with these formalisms is polynomial in the length of the input sentence (for a fixed grammar), they are often considered too slow to be practically useful. Also the accuracy of LCFRS-based parsers does not match up to their continuous counterparts.
Instead, a wide range of models that either apply transition systems with a reordering mecha- 1 The implementation is available at https://github. com/kilian-gebhardt/panda-parser/. nism or are based on a dependency-to-constituency transformation have been proposed in recent years (Hall and Nivre, 2008;Versley, 2014a;Maier, 2015;Fernández-González and Martins, 2015;Coavoux and Crabbé, 2017;Corro et al., 2017;Stanojević and Garrido Alhama, 2017;Fernández-González and Gómez-Rodríguez, 2020). There are two notable exceptions: van Cranenburgh et al. (2016) considers a discontinuous extension of the data-oriented parsing approach. Gebhardt (2018) studies the extension of two grammar formalisms with latent annotations: LCFRS and hybrid grammars (Nederhof and Vogler, 2014), which is a synchronous grammar formalism that couples a string generating grammar (specifically: LCFRS) and a tree generating grammar (specifically: simple definite clause programs; Deransart and Małuszyński, 1989). In particular, Gebhardt (2018) analyses the effect of a generalization of Petrov et al.'s split/merge procedure (2006), which adaptively refines the grammar's nonterminals by automatic splitting and merging. Although this refinement strategy showed vast improvements over the respective unrefined grammar, some choices in the experimental setup of Gebhardt (2018) can be enhanced: (i) The expectation maximization algorithm (EM, Baker, 1979), which is a subroutine of the split/merge procedure, utilizes a likelihoodbased objective that is prone to overfitting.
(ii) Ensemble-models obtained by running the split/merge procedure with different random seeds were successfully applied for continuous parsing (Petrov, 2010) but not considered in Gebhardt (2018).
(iii) Gebhardt (2018) supposes that the initial granularity of the grammar's nonterminals matters as the state-refinement procedure does not fully recover them.
In this work, we address the above points by (i) comparing the likelihood-based objective to a Maximum-a-Posteriori (MAP) objective, (ii) evaluating ensemble models, and (iii) proposing a new nonterminal naming scheme for hybrid grammars. We hypothesize that these steps are complementary in improving the accuracy of the parsing model. Although we do not expect to reach or even surpass the performance of recent discriminative approaches (in particular those utilizing neural nets), we suppose that these experiments foster the understanding of the limits of the different architectures.

Model
We consider models based on LCFRS and hybrid grammars. Vijay-Shanker et al. (1987) and Gebhardt et al. (2017) give formal definitions of these formalisms, respectively. Baseline grammars are induced from the training data using the induction techniques of Kallmeyer and Maier (2013) and Gebhardt et al. (2017), respectively. LCFRS are either binarized right-to-left (r2 ) or head-outward (ho) with vertical and horizontal Markovization set to 1. For hybrid grammars the induction is parametrized such that the LCFRS-component has fanout 2.
The baseline grammars are refined using a variant of Petrov et al.'s split/merge algorithm (2006), which is described in Gebhardt (2018). This algorithm consists of multiple split/merge cycles in each of which the EM algorithm is used to fit the rule weights to the training data. To obtain MAP training, we modify the EM algorithm by including a Dirichlet prior (see Johnson et al., 2007, Sec. 2.3). This is implemented by incrementing each rule count by a non-negative default value in the "expectation" phase. Gebhardt (2018) reports that LCFRS outperform hybrid grammars but conjectures that this might be an artifact of the choice of nonterminals in the baseline grammar. The nonterminal naming schemes child and strict labeling for hybrid grammars (originally introduced by Nederhof and Vogler, 2014) name a nonterminal based on the subset of tree nodes U 2 that is generated by the nonterminal. As illustrated in fig. 1, in strict labeling the roots of maximum subtrees formed by U are used in the label, where consecutive sibling nodes are joined. In child labeling, sequences of sibling nodes of length a b c d e f g scheme nonterminal strict: b, e f g child: b, children(a) strict-Markov: b, a e Figure 1: A tree, a subset of its nodes (underlined), and the nonterminals created for this subset according to different nonterminal labeling schemes. > 1 are replaced by the "children(p)" where p is the parent node. Gebhardt (2018) supposes that these strategies either lead to a number of nonterminals that is too small or too large for favorable automatic refinement. We explore a trade-off by Markovizing (cf. Klein and Manning, 2003) the strict labeling strategy. Where strict labeling uses sequences s = n 1 · · · n k of sibling nodes in nonterminal labels such that k > 1, we now use the parent node p of n 1 · · · n k , the first node n 1 , and a cut-off marker . Each nonterminal label then consists of sequences s = p n 1 .
After split/merge refinement, the grammars are used to parse unseen sentences. Since exact parsing is NP-hard already for PCFG-LA (see Matsuzaki et al., 2005), we follow the literature and apply an approximate tractable objective called max-ruleproduct (short: mrp, which projects weights from the refined grammar to a sentence-specific coarse one and computes the the most probable derivation of the latter, see Petrov and Klein, 2007). Petrov (2010) presents experiments with ensemble models, where different PCFG-LA are obtained from the same training data by changing the random seed. During parsing the mrp objective is applied where each rule weight of the coarse grammar is set to the product of the weights that result from projecting with the individual grammars. This ensemble grammar showed significantly improved accuracy over the best single grammar. We instantiate this approach for LCFRS and hybrid grammars using 4 random seeds.

Experiments
Data. We present experiments on the German TIGER (Brants et al., 2004) and NEGRA (Skut et al., 1997) corpora. For TIGER, we use the SPMRL split (Seddah et al., 2014, short: SPMRL) and the split by Hall and Nivre (2008, short: HN08) but optimize hyperparameters solely on HN08. For NEGRA we use the split by Dubey and Keller (2003). For evaluation we compute the labelled  Table 2: (Average) F1-scores on the dev. set (length ≤ 40) after training for 4 s/m cycles at 50% merge rate (HN08) and 6 s/m cycles at 80% merge rate (NEGRA). Columns labeled (ens.) show scores for the ensemble models. F1 and labelled discontinuous F1 using discodop 3 (van Cranenburgh et al., 2016) with the included proper.prm parameter file. In the tables we list average F1 scores and standard deviation over 4 grammars with different random seeds, except for ensemble experiments where we execute just one run combining those 4 grammars.
Properties of baseline grammars. We induce baseline grammars from the training data and display their properties in table 1. We see that Markovizing the strict labels effectively reduces the number of nonterminals (NTs) and leads to the most accurate baseline grammars. Still, this approach comes at the cost of reduced coverage in comparison to the child labeling hybrid grammar and the LCFRS. On NEGRA we see an increase in parse failures also in comparison to strict labeling. This is due to the addition of vertical context 3 https://github.com/andreasvc/disco-dop (i.e., the parent node) that is not present in strict labeling.
Training objective. We use the split/merge algorithm to refine the baseline grammars using both a likelihood-based and a MAP-based objective with default count 1.0 during EM training 4 . Differences between both training modes are displayed in table 2 for HN08 and NEGRA. MAP training in many settings improves the accuracy. In particular, using non-ensemble grammars the average F1 score on all constituents always improves. The average F1 4 Next to the default count the training has other hyperparameters, e.g., the rate of splits that is merged or the number of s/m cycles. In early experiments we found default count values around 1.0 to work best. Also, an increase of the merge rate from 50% to e.g. 80% or 90% often does not harm the accuracy and allows for smaller grammars or the execution of additional split/merge cycles. We used such an optimized setting for NEGRA (and for TIGER on the test set). An exhaustive grid search that optimizes these parameters for all considered grammars and corpora is computationally expensive and beyond the scope of this article.  score on only discontinuous constituents does not adhere to this trend in two cases. Note however that the prediction of discontinuous constituents seems to be comparably unstable (cf. the high variance). This indicates that a larger sample size is needed to reliably judge the influence of MAP training on the discontinuous F1. Also in case of ensemble models there are two grammars where the ensemble model trained with the MAP objective is less accurate.
Ensemble models. Comparing the (Disc.) F1scores of the ensemble model with the averages of the individual grammars, we always see an improvement, which in many cases is also well above the standard deviation.
Selection of grammars. For HN08 we obtain the best results with hybrid grammars with the Markovized strict labeling, which also outperform LCFRS in contrast to the experiments by Gebhardt (2018). In experiments with NEGRA we see that the child nonterminal scheme is more accurate than the Markovized strict one. This might be explained by the smaller corpus size which may lead to sparsity problems if nonterminal granularity is higher. The hybrid grammar with child labeling scores better than the LCFRS.
External comparison. Test set results are given in table 3. For NEGRA we apply the child labeling scheme and train for 7 s/m cycles using a merge rate of 80% and the MAP objective. For TIGER we apply the Markovized strict labeling scheme and train for 5 s/m cycles using a merge rate of 90% and the MAP objective. The comparison with results from the literature indicates that the ensemble of hybrid grammars performs better than other grammar-based approaches except for the pseudo-projective approach by Versley (2016). They are also more accurate than early dependencyto-constituency and transition-based approaches 5 . However, recent models, in particular those utilizing neural nets, are far more accurate than the hybrid grammars.

Discussion and conclusions
The experiments provide further evidence that the split/merge method is applicable and effective beyond PCFG. The use of priors and ensembles of grammars is mostly beneficial and complementary. From the performance differences between child labeling and Markovized strict labeling, we can surmise that the initial nonterminal granularity matters as the split/merge method cannot fully recover important splits or at least during parsing the mrp objective relies on guidance by the baseline nonterminal structure. More generally, the performance differences between the considered grammars indicate that a careful choice of the grammar formalism and the extraction algorithm is not redundant despite split/merge refinement.
Interestingly, the pseudo-projective approach by Versley (2016) outperforms our strictly discontinuous one. He uses a linguistically motivated (de)projectivization strategy 6 that seems to address the sparsity of discontinuous constituents in the data very well. Hence, we may conjecture that true discontinuous grammar formalisms that make available a large number of discontinuous productions (based on scarce evidence) may rarely benefit from the additional expressiveness. To substantiate this claim certainly a controlled experiment is necessary as differences may as well be artefacts of the handling of lexical and fall-back rules by Versley (2016). However, a similar observation was made concerning (discontinuous) tree substitution grammars (van Cranenburgh et al., 2016). Also a recent study by Corro (2020) finds that a very restricted mode of discontinuity, which can be simulated by a series of continuous combinations, is more accurate than more expressive modes.
The research on discontinuous parsing with latent variable grammars may also be extended by considering spectral algorithms (cf. Cohen, 2017, for an overview). In particular, Louis and Cohen (2015) use latently annotated LCFRS obtained by spectral algorithms to parse the topical structure of forum threads. Yet, the application of spectral algorithms for discontinuous syntactic parsing has not been investigated.