Association Metrics in Neural Transition-Based Dependency Parsing

Lexical preferences encoded as association metrics have been shown to improve performance on structural ambiguities that are still challenging for modern parsers. This paper introduces a mechanism to include lexical preferences into a neural transition-based dependency parser for German. We compare pointwise mutual information (PMI) and embedding-based scores. Both the PMI-based model and the embedding-based model outperform the baseline signiﬁcantly. The best model is PMI-based and increases overall performance by 0.26 LAS points over the baseline


Introduction
Structural ambiguities that cannot be solved purely on the basis of structural preferences still pose a major challenge to syntactic parsing.Prepositional phrase (PP) attachment and subject-object inversion are two examples of such ambiguities.Table 1 gives an overview of the most frequent parser errors in a German newspaper corpus of 20K sentences and 350K tokens, parsed by the De Kok and Hinrichs (2016) parser with 92.01 labeled attachment score.It shows that more than one third of all errors involves prepositions, subjects and accusative objects.

Relation
Error Resolving such ambiguities often requires context information or world knowledge.In Example 1, the direct object Problem 'problem' is fronted.The parser, however, learns from training data a preference for the unmarked word order with sentence-initial subject.Problem would therefore be misclassified as subject.Additionally, both Problem and Post 'post' are ambiguous between nominative and accusative case.Information on the sentence level thus does not suffice to decide on the correct attachment.Contextual knowledge reveals that Problem typically attaches to lösen 'to solve' as direct object.
Semantic preferences can provide further disambiguation cues.The verb lösen prefers an animate subject and an inanimate direct object.In Example 1, both Problem and Post are inanimate.World knowledge is necessary to interpret Post as the (animate) group of postal employees.Such knowledge can be learned from large corpora.Semantic preferences then yield the correct analysis of Post as animate subject and Problem as inanimate direct object of lösen.Pointwise mutual information (PMI, Fano (1961)) has been used to measure selectional preferences (Church and Hanks, 1990).PMI indicates how much two words occur together more often than chance.In the example above, a high PMI of lösen and Problem in verb → direct ob ject relations would already provide enough information to solve the subject-object ambiguity.As PMIs are ideally calculated from large corpora, they provide additional context information.
In more traditional analyses of dependency distributions, it has been shown that PMI is very beneficial to solve structural ambiguities such as PP attachment (Hindle and Rooth, 1993;Ratnaparkhi, 1998;Volk, 2002).In parsing, bilexical preferences have been used by Van Noord (2007) to improve syntactic ambiguity resolution in a Maximum-Entropy parser for Dutch.Kiperwasser and Goldberg (2015) extended bilexical preferences to contextual association scores based on PMI and dependency embeddings (Levy and Goldberg, 2014a) in a graph-based parser.Mirroshandel and Nasr (2016) integrated selectional preferences into a graph-based dependency parser.
Recent approaches to neural dependency parsing (Chen and Manning, 2014;Kiperwasser and Goldberg, 2016;Dozat and Manning, 2017) implicitly encode information about co-occurrences through vector representations of the token input (Mikolov et al., 2013). However, De Kok et al. (2017) have shown for PP attachment that neural models can still benefit from information provided by PMI scores.
This paper argues that bilexical preferences are also useful in neural transition-based dependency parsing.The two main contributions are 1) a methodology to apply bilexical preferences to neural transition-based dependency parsing, and 2) an evaluation of two types of association metrics in a neural dependency parser.Results confirm that association metrics benefit neural dependency parsing.The best association score models outperform the baseline by 0.26 LAS points and improve performance on two ambiguity solving tasks by up to 2.33 points.

Approach
Transition-based dependency parsing is the task of establishing dependency relations between tokens (Kübler et al., 2009).Typically, unprocessed tokens are put on a buffer β , and a stack σ keeps track of the partially processed tokens.In the transition system used in this work, sometimes called the stackprojective system, attachments are made between the token on top of the stack and the second token on the stack (Nivre, 2004).A LEFTARC transition attaches the second token on the stack as a dependent of the token on top of the stack with relation r ∈ R, and vice versa for a RIGHTARC transition: Association scores can inform a parser about whether an attachment with a particular dependency relation should be made between two attachment sites.For each parser state, two attachments are possible with any of the dependency relations that are available in that system.1 Association scores for all possible attachments provide disambiguation cues at each state.They are added to the feature vector that is used as input to the transition classifier.Association score vectors enhance existing vector representations of words, part-of-speech tags, characters, dependency relations and morphological features.

Parser Integration
For each parser state, association scores are retrieved for LEFTARC and RIGHTARC transitions, and for all possible dependency relations.Equation 1 defines the association score vector for a stack-projective transition system with transitions between the token on top of the stack s 0 and the second token on the stack s 1 : Example 2 provides the resulting association score vector in a stack-projective system with a dependency relation set that contains the subject, object and preposition relation.
(2) v assoc = [assoc(s 0 , s 1 , sub ject), assoc(s 0 , s 1 , ob ject), assoc(s 0 , s 1 , preposition), assoc(s 1 , s 0 , sub ject), assoc(s 1 , s 0 , ob ject), assoc(s 1 , s 0 , preposition)] with R = {sub ject, ob ject, preposition} If no association score is available for a dependency triple, a default is assigned.An optional binary indicator b ∈ {0, 1} specifies whether the dependency triple was known.This makes it possible for the model to distinguish between the default value and association strengths that overlap with the default value.The binary indicators are added to the association score vector.The association score vectors are concatenated with the remaining input feature vectors to represent a parser configuration.2

Association Metric Variants
Pointwise mutual information.Traditionally, PMI has been a means to capture bilexical preferences.Normalized (NPMI, Bouma (2009)) and positive normalized PMI (PNPMI, Van de Cruys ( 2011)) with add-1 Laplace smoothing have been tested in the parsing model.Given the dependency triple h r − → d, consisting of the head h, dependent d and dependency relation r, PMI is defined as: The probability of h and d as heads and dependents with relation r is represented as is a more easily interpretable variant of PMI, limiting the range of PMIs to lie between -1 and 1. Positive PMI PMI pos (h rounds negative PMIs to 0. Dependency embedding scores.PMI is likely to suffer from sparseness of dependency triples in the training data.Previous attempts have used back-off models (Collins and Brooks, 1995) to counteract this problem.The dependency embedding model by Levy and Goldberg (2014a) estimates probabilities for unseen triples h An embedding-based association score for the head word embedding W h and the context embedding C d,r of the dependent d that is related to a head h via the dependency relation r can be formulated as: where C ∈ R |V |×r×d and W ∈ R |V |×d .In the current model, the maximum entropy probability of 0.5 is assigned as a default when no embedding for h, d or both is available and no score can be calculated.Further model variations also include a binary indicator to distinguish the default score from a calculated embedding-based score.In a more finegrained binary indicator model, the indicator informs the parser for which of the two tokens no embedding was available.Levy and Goldberg (2014b) have shown that the skip-gram model is an implicit factorization of the shifted PMI matrix of word co-occurrences.Dependency embeddings (Levy and Goldberg, 2014a) therefore implicitly factorize the shifted PMI matrix of head-dependent co-occurrences.Hence, association scores based on dependency embeddings (Kiperwasser and Goldberg, 2015) can be seen as correlated with PMIs.

Experimental Setup
The neural transition-based dependency parser of De Kok and Hinrichs (2016) serves as the baseline for the experiments.Words, part-of-speech tags and characters are represented as vectors that were trained with structured skip-gram (Ling et al., 2015).Topological fields are used as additional input features.The parser does pseudo-projective parsing (Nivre and Nilsson, 2005) and was trained on the shuffled TüBa-D/Z (Telljohann et al., 2017) that contains 105K sentences and 1.9M tokens of manually labeled data from the Berliner Tageszeitung (taz).Non-gold part-of-speech tags were trained via 10-fold jackknifing on the TüBa-D/Z. 3The data was split in a 7:1:2 ratio for respectively training, development and testing.Association scores are retrieved for lowercased word forms to increase lexical coverage.Common and proper nouns are typically capitalized in German and were therefore not lowercased.
Results are presented as labeled (LAS) and unlabeled attachment scores (UAS) including punctuation.Accuracies for inversion and prepositions indicate performance on resolving ambiguities.Inversion accuracy reports correct labeling of subjects and objects in clauses with fronted object.Preposition accuracy comprises all correct heads and labels of prepositional phrases and objects.The test set contains 1,887 cases of inversion (5.82 percent of all clauses) and 31,687 prepositional phrases and objects.

PMIs in Neural Dependency Parsing
A table of PMIs was generated for dependency triples h r − → d from the German newspaper taz (393.7Mtokens, 22.8M sentences) and a dump of the German Wikipedia from January 2018 (803.5Mtokens, 39.9M sentences), two subcorpora of the TüBa-D/DP treebank (De Kok and Pütz, 2019) parsed by the De Kok and Hinrichs (2016) parser without association scores.All dependency triples not contained in the table are mapped to the most neutral value of 0. The PMI table is generated once in linear time.The same holds for the dependency embeddings described in Section 3.3.Each association score retrieval is then done in constant time so that the linear time property of parsing remains unchanged.PMI models with minimum dependency triple frequencies of 5, 50 and 100 have been trained with both NPMI and PNPMI scores.NPMI models have been tested with and without binary indicator.Results for the PMI models are given in Table 2.

Model
The best PMI model uses normalized PMI with a minimum frequency of 5.The model outperforms the baseline by 0.26 LAS points which is significant in the Wilcoxon test (Dror et al., 2018) with p < 5.24 × 10 −10 .It also improves the LAS by 0.03 points over the best embedding-based model but the improvement is not statistically significant.Larger improvements can be seen for both sorts of ambiguity.The best model increases inversion LAS by 1.54 points and preposition LAS by 0.98 points over the baseline.

Dependency Embedding Scores in Neural Dependency Parsing
For the embedding-based model, dependency embeddings with 300 dimensions were trained with the algorithm from Levy and Goldberg (2014a). 4Different embeddings have been trained on pseudoprojectivized and non-projective versions of taz, Wikipedia, and the German europarl (1.25B tokens and 42.1M sentences in total).The number of dependency relations varies from 38 non-projective to 212 pseudo-projective relations.
All embedding variants have been trained on regular head-dependent and inverse dependent-head relations.A fully typed model was trained on context that includes the token typed per dependency relation.A second semi-typed model includes the token without dependency relations as context.For both models, variants with and without binary indicator have been evaluated.The binary model uses a simple binary indicator which labels association scores as default or as being calculated from dependency embeddings.A more finegrained triple-binary model for fully typed embeddings evaluates the following three conditions to true or false: 1) the head word embedding could be retrieved from the focus matrix, 2) the dependent word embedding, i.e. the combination of the context token and the dependency relation, could be retrieved from the context matrix, 3) an embedding for the context token could be retrieved from the focus word matrix, indicating whether there exists a word embedding for the token at all.The double-binary model for semi-typed embeddings indicates whether an embedding has been found for the focus and the context token.As the context token is not typed for dependencies in the semi-typed model, the context matrix contains entries for tokens without the different dependency relations they occur with.Results for parsing with association scores based on dependency embeddings are shown in Table 3.The overall best embedding-based model uses projectivized, fully typed embeddings with a binary indicator.The model outperforms the baseline parser by 0.23 LAS points, significant in the Wilcoxon test (p < 1.94 × 10 −7 ), and remains 0.03 points below the best PMI model.Embedding-based models are only superior to PMI models when it comes to inversion LAS.There, the best embedding-based model improves by 2.33 points over the baseline, compared to 1.54 points improvement of the best PMI model.

Evaluation
Both the PMI-based and embedding-based models perform better than the baseline.Overall performance will improve by more correctly solved ambiguous attachments.Lexical associations between more than two tokens may be necessary to further improve ambiguity resolution.For PP attachment, the compatibility between the preposition, its modifier noun and the verbal or nominal head candidate of the PP have to be modeled.De Kok et al. (2017) have shown that trilexical preferences help to better capture attachment preferences of the preposition.
It can also be beneficial to make competing attachment sites available to the parser.Currently, association scores are only computed for the two attachment candidates for any given parser state.With beam search, several attachment candidates can compete in different analyses.The best candidate can then be chosen from all or the n best candidates (Zhang and Clark, 2008;Andor, 2016).

Ambiguity Resolution with Association Metrics
Most parser errors still involve a limited number of dependency relations, as shown in Table 1.Errors in PP attachment, subjects and objects often can be traced back to problems with resolving ambiguities.An evaluation of association scores for particular word pairs can show if such scores can be useful in parsing ambiguous sentences.Table 4 lists PMI-and embedding-based scores for selected word pairs and dependency relations.Random pairs that are common in everyday language are distinguished from pairs that occur in subject-object inversion and have been incorrectly attached by the (best-performing embedding-based) parser.PMIs have been retrieved from the positive normalized PMI table with minimum frequency 5. Embedding-based scores were calculated from projectivized, fully typed dependency embeddings.Problems of data sparsity can indeed be solved by using embedding-based rather than PMI-based scores, as Table 4 shows.In spite of a low frequency threshold of 5, the PMI table is very sparse compared to the embedding-based scores.However, when a PMI is available scores indicate the correct tendency in the majority of the cases.Considering that all unknown values are equal to the default PMI of 0.0, the tendencies are correct for e.g.trinkt 'drinks' which prefers to attach Mann 'man' as the subject and Milch 'milk' as the direct object.The tendencies of embedding-based scores are mostly correct, such as the preference of Spaghetti 'spaghetti' to attach to isst 'eats' as a direct object.Wider lexical coverage of embedding-based models may not lead to any gains over PMI-based models partially due to the architecture of the neural dependency parser which already encodes information about cooccurrences in the distributional representations of the input tokens.

Conclusion
This paper presented a technique to include association metrics into a neural transition-based dependency parser for German.PMI and embedding-based association scores have been tested.Both PMI-based and embedding-based models significantly outperform the baseline.In spite of the wider lexical coverage of embedding-based models, PMI models achieve accuracies on a par with embedding-based models.
A qualitative analysis revealed that association scores in parts provide useful disambiguation cues to the parser.Follow-up experiments in other languages with relatively free word order and moderately complex morphology will further investigate the effect of association metrics on neural transition-based dependency parsing.Due to its similarity to German, Dutch will be the first language to be examined.Trilexical rather than bilexical preferences could further improve results.Keeping more competing attachment candidates through beam search is another promising direction for future work.As an alternative to association scores, a compatibility model that is directly integrated into the parser could be considered.
Federal Post Office still has to solve this problem.' r − → d from word embeddings.The model predicts the probability p(1|h r − → d) of a dependency triple.Words are represented as embeddings that are trained jointly with the classifier p

Table 1 :
Five most frequent parser errors by dependency label of the parser by De Kok and Hinrichs (2016) for a German newspaper corpus.More than one third of all errors involves prepositions, subjects and accusative objects.

Table 2 :
Parser accuracy (overall, inversion, preposition attachment) for neural dependency parsing with PMI-based association scores.The NPMI model with minimum frequency 5 achieves the best overall performance.

Table 3 :
Parser accuracy (overall, inversion, preposition attachment) for neural dependency parsing with embedding-based association scores.The overall best model uses projectivized, fully typed dependency embeddings with a binary indicator.

Table 4 :
PMI and embedding-based scores for random and incorrectly attached dependency triples.