Valency-Augmented Dependency Parsing

We present a complete, automated, and efficient approach for utilizing valency analysis in making dependency parsing decisions. It includes extraction of valency patterns, a probabilistic model for tagging these patterns, and a joint decoding process that explicitly considers the number and types of each token’s syntactic dependents. On 53 treebanks representing 41 languages in the Universal Dependencies data, we find that incorporating valency information yields higher precision and F1 scores on the core arguments (subjects and complements) and functional relations (e.g., auxiliaries) that we employ for valency analysis. Precision on core arguments improves from 80.87 to 85.43. We further show that our approach can be applied to an ostensibly different formalism and dataset, Tree Adjoining Grammar as extracted from the Penn Treebank; there, we outperform the previous state-of-the-art labeled attachment score by 0.7. Finally, we explore the potential of extending valency patterns beyond their traditional domain by confirming their helpfulness in improving PP attachment decisions.


Introduction
Many dependency parsers treat attachment decisions and syntactic relation labeling as two independent tasks, despite the fact that relation labels carry important subcategorization information. For example, the number and types of the syntactic arguments that a predicate may take is rather restricted for natural languages -it is not common for an English verb to have more than one syntactic subject or more than two objects.
In this work, we present a parsing approach that explicitly models subcategorization of (some) syntactic dependents as valency patterns (see 1 Our implementation is available at https:// github.com/tzshi/valency-parser-emnlp18 He says that you like to swim . , and operationalize this notion as extracted supertags. An important distinction from prior work is that our definition of valency-pattern supertags is relativized to a userspecified subset of all possible syntactic relations (see §3). We train supertaggers that assign probabilities of potential valency patterns to each token, and leverages these probabilities during decoding to guide our parsers so that they favor more linguistically plausible output structures.
We mainly focus on two subsets of relations in our analysis, those involving core arguments and those that represent functional relations, and perform experiments over a collection of 53 treebanks in 41 languages from the Universal Dependencies dataset (UD; . Our valencyaware parsers improve upon strong baseline systems in terms of output linguistic validity, measured as the accuracy of the assigned valency patterns. They also have higher precision and F1 scores on the subsets of relations under analysis, suggesting a potentially controlled way to balance precision-recall trade-offs.
We further show that our approach is not limited to a particular treebank annotation style. We apply our method to parsing another grammar formalism, Tree Adjoining Grammar, where dependency and valency also play an important role in both theory and parser evaluation. Our parser reaches a new state-of-the-art LAS score of 92.59, with more than 0.6 core-argument F1-score improve-ment over our strong baseline parser.
Finally, we demonstrate the applicability of our valency analysis approach to other syntactic phenomena less associated with valency in its traditional linguistic sense. In a case study of PP attachment, we analyze the patterns of two syntactic relations commonly used in PP attachment, and include them in the joint decoding process. Precision of the parsers improves by an absolute 3.30% on these two relation types.

Syntactic Dependencies and Valencies
According to Nivre (2005), the modern dependency grammar can be traced back to Tesnière (1959), with its roots reaching back several centuries before the Common Era. The theory is centered on the notion of dependency, an asymmetrical relation between words of a sentence. Tesnière distinguishes three node types when analyzing simple predicates: verb equivalents that describe actions and events, noun equivalents as the arguments of the events, and adverb equivalents for detailing the (temporal, spatial, etc.) circumstances. There are two types of relations: (1) verbs dominate nouns and adverbs through a dependency relation; (2) verbs and nouns are linked through a valency relation. Tesnière compares a verb to an atom: a verb can attract a certain number of arguments, just as the valency of an atom determines the number of bonds it can engage in (Ágel and Fischer, 2015). In many descriptive lexicographic works (Helbig and Schenkel, 1959;Herbst et al., 2004), valency is not limited to verbs, but also includes nouns and adjectives. For more on the linguistic theory, see Ágel et al. (2003, 2006).
Strictly following the original notion of valency requires distinguishing between arguments and adjuncts, as well as obligatory and optional dependents. However, there is a lack of consensus as to how these categorizations may be distinguished (Tutunjian and Boland, 2008), and thus we adopt a more practical definition in this paper.

Computational Representation
Formally, we fix a set of syntactic relations R, and define the valency pattern of a token w i with respect to R as the linearly-ordered 2 sequence 2 Our approach, whose full description is in §5, can be adapted to cases where linear ordering is de-emphasized. The algorithm merely requires a distinction between left and right a´j¨¨¨a´1˛a 1¨¨¨ak : the˛symbol denotes the center word w i , and each a l asserts the existence of a word w dominated by w i via relation a l P R, w i a l Ý Ñ w. For a l and a m , when l ă m, the syntactic dependent for a l linearly precedes the syntactic dependent for a m . As an example, consider the UD-annotated sentence in Fig. 1. The token says has a core-relation 3 valency pattern nsubjc comp, and like has the pattern nsubj˛xcomp. If we consider only functional relations, both like and swim have the pattern mark˛. 4 We sometimes employ the abbreviated notation α L˛αR , where α indicates a sequence and the letters L and R distinguish left dependencies from right dependencies.
We make our definition of valency patterns dependent on choice of R not only because some dependency relations are more often obligatory and closer to the original theoretical definition of valency, but also because the utility of different types of syntactic relations can depend on the downstream task. For example, purely functional dependency labels are semantically vacuous, so they are often omitted in the semantic representations extracted from dependency trees for question answering (Reddy et al., 2016. There are also recent proposals for parser evaluation that downplay the importance of functional syntactic relations (Nivre and Fang, 2017). dependents. We choose to encode linearity since it appears that most languages empirically exhibit word order preferences even if they allow for relatively free word order.
3 UD core and functional relations are listed in Table 1. 4 The (possibly counterintuitive) direction for that and to is a result of UD's choice of a content-word-oriented design.

Pilot Study: Sanity Checks
We consider two questions that need to be addressed at the outset: 5 1. How well do the extracted patterns generalize to unseen data?
2. Do state-of-the-art parsers already capture the notion of valency implicitly, though they are not explicitly optimized for it?
The first question checks the feasibility of learning valency patterns from a limited amount of data; the second probes the potential for any valencyinformed parsing approach to improve over current state-of-the-art systems.
To answer these questions, we use the UD 2.0 dataset for the CoNLL 2017 shared task (Zeman et al., 2017) and the system outputs 6 of the top five performing submissions (Dozat et al., 2017;Shi et al., 2017b;Björkelund et al., 2017;Che et al., 2017;Lim and Poibeau, 2017). Selection of treebanks is the same as in §6. We extract valency patterns relative to the set of 6 UD core arguments given in Table 1 because they are close to the original notion of valency and we hypothesize that these patterns should exhibit few variations. This is indeed the case: the average number of valency patterns we extract is 110.4 per training treebank, with Turkish (tr) having the fewest at 34, and Galician (gl) having the most at 298 patterns. We observe that in general, languages with higher degree of flexibility in word order tend to generate more patterns in the data, as our patterns encode linear word order information.
Next, we extract valency patterns from the test set and compare them against those from the training set. On average, out of the 55.4 patterns observed in the gold-standard test sets, only 5.5, or 9.98%, are new and unseen with respect to training. In comparison, 36.2% of the word types appearing in the test sets are not seen during training. This suggests that the valency pattern space is relatively restricted, and the patterns extracted from training sets do generalize well to test sets.
Finally, we consider the average number of valency patterns extracted from the top-performing 5 We actually performed these sanity checks after implementation and experiments of our approach, because we missed this idea and because it requires access to test sets that we abstained from looking at during model development. 6 Retrieved from https://lindat.mff.cuni.cz/ repository/xmlui/handle/11234/1-2424. system outputs and the number of those not observed in training. 7 All 5 systems are remarkably "hallucinatory" in inventing valency relations, introducing 16.8 to 35.5 new valency patterns, significantly larger than the actual number of unseen patterns. Below we show an error committed by the state-of-the-art Dozat et al. (2017) parser (upper half) as compared to the gold-standard annotation (lower half), and we highlight the core argument valency relations of the verb bothers in bold. The system incorrectly predicts how come to be a clausal subject.
How come no one bothers to ask ... Each such non-existent new pattern implies at least some (potentially small) parsing error that can contribute to the degradation of downstream task performance.

Overview
Our model is based on the following probability factorization for a given sentence x " w 1 , . . . , w n and parse tree y for x: where Z x is the normalization factor, v i is the valency pattern extracted for w i from y, h i is the index of the syntactic governor of w i , and r i is the syntactic relation label of the dependency relation between w h i and w i . We first assume that we have a feature extractor that associates each token in the sentence w i with a contextualized feature vector w i , and explain how to calculate the factored probabilities ( §5.2). Then we discuss decoding ( §5.3) and training ( §5.4). Our decoder can be viewed as a special-case implementation of headautomaton grammars (Alshawi, 1996;Eisner and Satta, 1999). Finally, we return to the issue of feature extraction ( §5.5). Figure 2: Eisner's (1996)/Eisner and Satta's (1999) algorithm, with valency-pattern annotations, incorporated as state information, shown explicitly. We show only the R-rules; the L-rules are symmetric.

Parameterization
We parameterize P pv i |w i q as a softmax distribution over all candidate valency patterns: where score VAL is a multi-layer perceptron (MLP). For each word w i , we generate a probability distribution over all potential syntactic heads in the sentence (Zhang et al., 2017). After we have selected the head of w i to be w h i , we decide on the syntactic relation label based on another probability distribution. We use two softmax functions: where both score HEAD and score LABEL are parameterized by deep biaffine scoring functions (Dozat and Manning, 2017).

Decoding
For joint decoding, we adopt the Eisner's (1996) algorithm annotated with valency patterns as the state information in Eisner and Satta (1999). The algorithm is depicted in Fig. 2. For each complete and incomplete span, visualized as triangles and trapezoids respectively, we annotate the head with its valency pattern. We adopt Earley's (1970) notation of ‚ to outward-delimit the portion of a valency pattern, starting from the center word˛, that has already been collected within the span. INIT generates a minimal complete span with hypothesized valency pattern; the ‚ is put adjacent to˛.
COMB matches an incomplete span to a complete span with compatible valency pattern, yielding a complete analysis on the relevant side of˛. LINK either advances the ‚ by attaching a syntactic dependent with the corresponding relation label, or attaches a dependent with a relation label irrelevant to the current valency analysis. This algorithm can be easily extended to cases where we analyze multiple subsets of valency relations simultaneously: we just need to annotate each head with multiple layers of valency patterns, one for each subset. 8 The time complexity of a naïve dynamic programming implementation is Op|V | 2 |α|n 3 q, where |V | is the number of valency patterns and |α| is the maximum length of a valency pattern. In practice, |V | is usually larger than n, making the algorithm prohibitively slow. We thus turn to A* parsing for a more practical solution.
A* parsing We take inspiration from A* CCG parsing (Lewis and Steedman, 2014;Lewis et al., 2016;Yoshikawa et al., 2017). The idea (see Alg. 1) is to estimate the best compatible full parse for every chart item (in our case, complete and incomplete spans), and expand the chart based on the estimated priority scores. Our factorization of probability scores allows the following admissible heuristic: for each span, we can optimistically estimate its best full parse score by assigning to Algorithm 1 Agenda-based best-first parsing algorithm, adapted from Lewis et al. (2016) A.INSERTpp 1 q every token outside the span the best possible valency pattern, best possible attachment and best relation label.

Training
We train all components jointly and optimize for the cross entropy between our model prediction and the gold standard, or, equivalently, the sum of the log-probabilities for the three distributions comprising our factorization from §5.1. This can be thought of as an instance of multi-task learning (MTL; Caruana, 1997), which has been shown to be useful in parsing (Kasai et al., 2018). To further reduce error propagation, instead of using part-of-speech tags as features, we train a tagger jointly with our main parser components (Zhang and Weiss, 2016).

Feature Extraction
We adopt bi-directional long short-term memory networks (bi-LSTMs; Hochreiter and Schmidhuber, 1997) as our feature extractors, since they have proven successful in a variety of syntactic parsing tasks (Kiperwasser and Goldberg, 2016;Cross and Huang, 2016;Stern et al., 2017;Shi et al., 2017a). As inputs to the bi-LSTMs, we concatenate one pre-trained word embedding, one randomly-initialized word embedding, and the output of character-level LSTMs for capturing sub-token level information (Ballesteros et al., 2015). The bi-LSTM output vectors at each timestep are then assigned to each token as its contextualized representation w i .

Experiments
Data and Evaluation Our main experiments are based on UD version 2.0, which was prepared for the CoNLL 2017 shared task (Zeman et al., 2017). We used 53 of the treebanks 9 across 41 languages that have train and development splits given for the shared task. In contrast to the shared-task setting, where word and sentence segmentation are to be performed by the system, we directly use the testset gold segmentations in order to focus directly on parsing; but this does mean that the performance of our models cannot be directly compared to the officially-reported shared-task results. For evaluation, we report unlabeled and labeled attachment scores (UAS and LAS respectively). Further, we explicitly evaluate precision, recall and F1 scores (P/R/F) for the syntactic relations from Table 1, as well as valency pattern accuracies (VPA) involving those relations.

Implementation Details
We use three-layer bi-LSTMs with 500 hidden units (250 in each direction) for feature extraction. The valency analyzer uses a one-hidden-layer MLP with ReLU activation function (Nair and Hinton, 2010), while the head selector and labeler use 512and 128dimensional biaffine scoring functions respectively. Our models are randomly initialized (Glorot and Bengio, 2010) and optimized with AMSgrad (Reddi et al., 2018) with initial learning rate 0.002. We apply dropout (Srivastava et al., 2014) to our MLPs and variational dropout (Gal and Ghahramani, 2016) to our LSTMs with a keep rate of 0.67 during training.
Efficiency Our A* parsers are generally reasonably efficient; for the rare (ă 1%) cases where the A* search does not finish within 500,000 chart expansion steps, we back off to a model without valency analysis. When analyzing three or more relation subsets, the initialization steps become pro- 9 We exclude the two large treebanks cs and ru_syntagrus due to experiment resource constraints. There are other Czech and Russian treebanks in our selected collection.   Table 3: Treebank-specific F1 scores on core argument relations, comparing the baseline models to our Core MTL + joint decoding models, sorted by the error reduction (ER, %) rate. When comparing a model with performance s 2 against baseline score s 1 , ER is defined as ps 2´s1 q{p1´s 1 q. For languages with two or three treebanks, we include multiple entries differentiated by the subscripts MAX/MID/MIN, corresponding to the treebanks with the highest/median/lowest ER, respectively. A. Greek = Ancient Greek.
hibitively slow due to the large number of valency pattern combinations. Thus, we limit the number of combinations for each token to the highestscoring 500.
Results on UD We present our main experimental results on UD in Table 2. The baseline system does not leverage any valency information (we only train the head selectors and labelers, and use the original Eisner decoder). We compare the baseline to settings where we train the parsers jointly with our proposed valency analyzers, distinguishing the effect of using this information only at training (multi-task learning; MTL) vs. both at training and decoding. Including valency analysis into the training objective already provides a slight improvement in parsing performance, in line with the findings of Kasai et al. (2018). With our proposed joint decoding, there is a mild improvement to the overall UAS and LAS, and a higher boost to VPA. The output parse trees are now more precise in the analyzed valency relations: on core arguments, precision increases by as much as 4.56. As shown by Table 3, the performance gain of joint decoding varies across treebanks, ranging from an error reduction rate of over 30% (Dutch Lassy Small Treebank) on core argument relations to nearly 0% (Japanese). Overall, our approach exhibits a clearly positive impact on most of the treebanks in UD. We do not see performance correlating to language typology, although we do observe smaller error-reduction rates on treebanks with lower baseline performances, that is, on "harder" languages.

Parsing Tree Adjoining Grammar
Dependency and valency relations also play an important role in formalisms other than dependency grammar. In this section, we apply our proposed valency analysis to Tree Adjoining Grammar (TAG; Joshi and Schabes, 1997), because TAG derivation trees, representing the process of inserting obligatory arguments and adjoining modifiers, can be treated as a dependency representation (Rambow and Joshi, 1997). We follow prior art and use Chen's (2001) automatic conversion of the Penn Treebank (Marcus et al., 1993) into TAG derivation trees. The dataset annotation has labels 0, 1 and 2, corresponding to subject, direct object, and indirect object; we treat these as our core argument subset in valency anal-ysis. 10 Additionally, we also analyze CO (co-head for phrasal verbs) as a separate singleton subset. We leave out adj (adjuncts) in defining our valency patterns. We strictly follow the experiment protocol of previous work (Bangalore et al., 2009;Chung et al., 2016;Friedman et al., 2017;Kasai et al., 2017Kasai et al., , 2018, and report the results in Table 4. The findings are consistent with our main experiments: MTL helps parsing performance, and joint decoding further improves on core argument F1 scores, reaching a new state-of-the-art result of 92.59 LAS. The precision recall trade-off is pronounced for the CO relation subset.

Case Study on PP Attachment
Although valency information has traditionally been used to analyze complements or core arguments, 11 in this section, we show the utility of our approach in analyzing other types of syntactic relations. We choose the long-standing problem of prepositional phrase (PP) attachment (Hindle and Rooth, 1993;Brill and Resnik, 1994;Collins and Brooks, 1995;de Kok et al., 2017), which is known to be a major source of parsing mistakes (Kummerfeld et al., 2012;Ng and Curran, 2015). In UD analysis, PPs usually have the labels obl or nmod with respect to their syntactic parents, whereas adpositions are attached via a case relation, which is included in the functional relation subset. Thus, we add another relation subset, obl and nmod, to our valency analysis. Table 5 presents the results for different combinations of valency relation subsets. We find that PP-attachment decisions are generally harder to make, compared with core and functional relations. Including them during training distracts other parsing objectives (compare Core + PP with only analyzing Core in §6). However, they do permit improvements on precision for PP attachment by 3.30, especially with our proposed joint decoding. This demonstrates the usage of our algorithm outside the traditional notions of valency -it can be a general method for training parsers to focus on specific subsets of syntactic relations.

Constrained Dependency Grammar
Another line of research (Wang and Harper, 2004;Foth et al., 2006;Foth and Menzel, 2006;Bharati et al., 2002Bharati et al., , 2009Husain et al., 2011) utilizes supertags in dependency parsing within the framework of constraint dependency grammar (CDG; Maruyama, 1990;Heinecke et al., 1998). Constraints in CDG may be expressed in very general terms (and are usually hand-crafted for specific languages), so prior work in CDG involves a constraint solver that iteratively or greedily up-date hypotheses without optimality guarantees. In contrast, our work focuses on a special form of constraints -the valency patterns of syntactic dependents within a subset of relations -and we provide an efficient A*-based exact decoding algorithm.

Valency in Parsing
To the best of our knowledge, there have been few attempts to utilize lexical valency information or to improve specifically on core arguments in syntactic parsing apart from CDG. Øvrelid and Nivre (2007) target parsing core relations in Swedish with specificallydesigned features such as animacy and definiteness that are useful in argument realization.
Jakubıček and Kovář (2013) leverage external lexicons of verb valency frames for reranking. Mirroshandel et al. (2012Mirroshandel et al. ( , 2013 and Mirroshandel and Nasr (2016) extract selectional constraints and subcategorization frames from large unannotated corpora, and enforce them through forest reranking. Our approach does not rely on external resources or lexicons, but directly extracts valency patterns from labeled dependency parse trees. Earlier works in this spirit include Collins (1997).

Semantic Dependency Parsing and Semantic
Role Labeling The notion of valency is also used to describe predicate-argument structures that are adopted in semantic dependency parsing and semantic role labeling (Surdeanu et al., 2008;Hajič et al., 2009;Oepen et al., 2014Oepen et al., , 2015. While semantic frames clearly have patterns, previous work (Punyakanok et al., 2008;Flanigan et al., 2014;Täckström et al., 2015;Peng et al., 2017;He et al., 2017) incorporates several types of constraints, including uniqueness and determinism constraints that require that certain labels appear as arguments for a particular predicate only once. They perform inference through integer linear programming, which is usually solved approximately, and cannot easily encode linear ordering constraints for the arguments.
A* parsing Best-first search uses a heuristic to expand the parsing chart instead of doing so exhaustively. It was first applied to PCFGs (Ratnaparkhi, 1997;Caraballo and Charniak, 1998;Sagae and Lavie, 2006), and then to dependency parsing (Sagae and Tsujii, 2007;Zhao et al., 2013;Vaswani and Sagae, 2016). Our probability factorization permits a simple yet effective A* heuristic. A* parsing was introduced for parsing PCFGs (Klein and Manning, 2003;Pauls and Klein, 2009), and has been widely used for grammar formalisms and parsers with large search spaces, for example CCG (Auli and Lopez, 2011) and TAG (Waszczuk et al., 2016(Waszczuk et al., , 2017. Our decoder is similar to the supertag and dependency factored A* CCG parser (Yoshikawa et al., 2017), which in turn builds upon the work of Lewis and Steedman (2014) and Lewis et al. (2016). Our model additionally adds syntactic relations into the probability factorizations.

Conclusions
We have presented a probability factorization and decoding process that integrates valency patterns into the parsing process. The joint decoder favors syntactic analyses with higher valency-pattern supertagging probabilities. Experiments on a large set of languages from UD show that our parsers are more precise in the subset of syntactic relations chosen for valency analysis, in addition to enjoying the benefits gained from jointly training the parsers and supertaggers in a multi-task learning setting.
Our method is not limited to a particular type of treebank annotation or a fixed subset of relations. We draw similar conclusions when we parse TAG derivation trees. Most interestingly, in a case study on PP attachment, we confirm the utility of our parsers in handling syntactic relations beyond the traditional domain of valency.
A key insight of this paper that departs from prior work on automatic extraction of supertags from dependency annotations is that our definition of valency patterns is relativized to a subset of syntactic relations. This definition is closer to the linguistic notion of valency and alleviates the data sparsity problems in that the number of extracted valency patterns is small. At the same time, the patterns generalize well, and empirically, they are effective in our proposed joint decoding process.
Our findings point to a number of directions for future work. First, the choice of subsets of syntactic relations for valency analysis impacts the parsing performance in those categories. This may suggest a controllable way to address precisionrecall trade-offs targeting specific relation types. Second, we experimented with a few obvious subsets of relations; characterizing what subsets can be most improved with valency augmentation is an open question. Finally, our decoder builds upon projective dependency-tree decoding algorithms. In the future, we will explore the possibility of removing the projective constraint and the tree requirement, extending the applicability of valency patterns to other tasks such as semantic role labeling.