Corpus-Driven Thematic Hierarchy Induction

Thematic role hierarchy is a widely used linguistic tool to describe interactions between semantic roles and their syntactic realizations. Despite decades of dedicated research and numerous thematic hierarchy suggestions in the literature, this concept has not been used in NLP so far due to incompatibility and limited scope of existing hierarchies. We introduce an empirical framework for thematic hierarchy induction and evaluate several role ranking strategies on English and German full-text corpus data. We hypothesize that global thematic hierarchy induction is feasible, that a hierarchy can be induced from just fractions of training data and that resulting hierarchies apply cross-lingually. We evaluate these assumptions empirically.


Introduction
Semantic roles are one of the core concepts in NLP, and automatic semantic role labeling (SRL) is a major task with applications in question answering (Shen and Lapata, 2007), machine translation (Liu and Gildea, 2010) and information extraction (Christensen et al., 2010). The goal of SRL is to label the semantic arguments of a predicate (e.g. a verb) with roles from a pre-defined role inventory. Conceptually, role assignment in SRL can be split in two steps: local labeling estimates the likelihood of a certain semantic argument bearing a certain role; global optimization takes context-dependent role interactions into account and enforces certain theoretically motivated constraints (e.g. "each role must appear only once per predication").
State of the art in SRL is held by the systems based on deep neural networks (Marcheggiani and Titov, 2017;He et al., 2017). While achieving remarkable quality on benchmark datasets, modern systems show a considerable ≈10-point performance drop when applied out-of-domain. This issue is aggravated by the fact that deep neural networks require significant amounts of training data, and SRL annotations are expensive to produce. While local role assignment can be augmented using unlabeled data (e.g. via pre-trained word and character embeddings), context-dependent role interaction is an SRL-specific phenomenon and can only be learned from annotated SRL corpora.
Aiming to reduce the training data requirements for SRL, we revisit the notion of thematic hierarchy (TH), a compact delexicalized way to model context-dependent role interactions. Thematic hierarchies assume that given a syntactic hierarchy (e.g. subject ≺ 1 object ≺ oblique) semantic roles can be ranked in a way that higher ranked roles take higher ranked syntactic positions. One example of phenomena captured by THs is the choice of subject: given a thematic hierarchy Agent ≺ ... ≺ Instrument, an Instrument can only become subject if the Agent is not present, e.g. "[John] Ag broke the window with a [hammer] In " → "A [hammer] In broke the window".
THs have received considerable attention in linguistic literature, but were so far impractical for use in NLP and SRL due to incompatibility and limited scope of the existing hierarchies. As a first step towards including THs into the NLP tool inventory we suggest an empirical framework for inducing THs from role-annotated corpora. Since VerbNet (Schuler, 2006) is the only SRL framework that operates with thematic roles, we choose it as our basis and perform experiments on the PropBank corpus (Palmer et al., 2005) enriched with VerbNet role labels via SemLink (Bonial et al., 2013).
The contributions of this paper are as follows: • We suggest a method for global thematic hierarchy induction from corpus data; • We propose several thematic and syntactic ranking models and evaluate them on English and German data; • We show that thematic hierarchies can be induced and applied cross-lingually while leaving room for improvement; we further show that thematic hierarchy induction is dataefficient and can produce a high-quality hierarchy using just a fraction of training data.
2 Related work 2.1 Semantic roles and the Lexicon Semantic roles in the modern sense have been introduced in 1960s as a way to account for variation in syntactic behavior of verbs which can not be explained by purely syntactic means (Gruber, 1965;Fillmore, 1968). A commonly used motivational example contrasts the use of verbs hit and break: while both are regular transitive verbs, hit does not allow construction (4); and construction (5) is ungrammatical in both cases.
( There exist several principled ways to describe the syntactic behavior of arguments in the lexicon. Available constructions can be defined individually on verb sense basis. This strategy is precise but highly redundant, since verbs show substantial similarities in syntactic behavior; besides, it does not generalize to the out-of-vocabulary (OOV) predicates.
A step towards a more general representation is verb class grouping (Levin, 1993): verbs senses can be grouped into verb classes with syntactic behavior shared among the members of the class. For example, syntactically break behaves like crash, shred and split, while hit behaves like bash and whack in the corresponding verb senses. This significantly reduces the lexicon redundancy and allows treatment of the OOV verbs if the verb class can be determined. A similar level of granularity is used by the major SRL frameworks: FrameNet SRL (Das et al., 2010) and, to some extent, Prop-Bank SRL (Roth and Woodsend, 2014).
Semantic arguments share similarities across verb classes, giving rise to the notion of gen-eral semantic roles. While there exists no consensus on the inventory of semantic roles, a subset shared by the most theoretical approaches includes roles such as Agent (the active sentient initiator of the event), Theme (the most affected participant), Result (the outcome of the event), Instrument (the instrument used) etc. Semantic roles show similar behavior across languages and can be thought of as grammatically relevant universal categories humans use to conceptualize real-world events. Following common terminology, we further refer to general, predicateindependent semantic roles as thematic roles. This level of granularity is, for example, used by VerbNet (Schuler, 2006).
Thematic roles' syntactic behavior depends on the presence of other thematic roles in the sentence: as our example above demonstrates, an Instrument can only take the subject position if the Agent is not present (3); and Theme can only become subject if both Agent and Instrument are not expressed (4-5). A widely used modeling tool to account for context dependency is the thematic hierarchy (TH): given a syntactic prominence scale (e.g. subject ≺ oblique... ≺ object), one can assume that there exists a universal ranking of thematic roles, which is homomorphic to the syntactic ranking (e.g. Agent ≺ Instrument ≺ Theme). The topranking semantic argument gets assigned to the highest available syntactic position, the secondranking gets the second-highest position, etc.
THs are a compact delexicalized way to describe semantic roles' syntactic behavior at the grammar level, which could reduce data requirements and improve generalization capability of SRL systems. However, THs from the literature come from varying theoretical backgrounds, are based on different syntactic formalisms and operate with different role inventories. Most of these THs are justified via basic (often synthetic) language examples, aiming to verify a certain theory cross-lingually rather than to describe the language use in a compact way.

Major SRL Frameworks
The choice of linguistic theory in SRL is mostly dictated by the availability of training data. Prop-Bank SRL is based on the PropBank corpus (Palmer et al., 2005) which utilizes a set of predicate-specific core roles (A0-5) and a set of general, predicate-independent adjunct roles (AM-TMP, AM-LOC etc.). Core roles are defined on verb sense level. An effort is made to ensure consistency in assigning A0 (Agent-like) and A1 (Patient-like). The rest of the core arguments (A2-5) are verb sense-specific; no finer-grained distinctions between roles are made.
PropBank annotation is closely tied to syntax. FrameNet (Baker et al., 1998) takes a different stance and focuses on accurate and detailed representation of event semantics. Verbs (as well as lexemes from other categories) are grouped into frames so that members of the same frame share a set of fine-grained frame-specific semantic roles (e.g. Impactee, Force, Buyer, Goods).
Both PropBank and FrameNet SRL operate on the verb sense/verb class generalization level. VerbNet (Schuler, 2006) groups verbs into Levininspired verb classes and defines sets of general, lexicon-level thematic roles and constructions for each class. It is the only SRL formalism that operates with a thematic role set. VerbNet role sets and verb class information are mapped to the PropBank corpus annotations via SemLink (Bonial et al., 2013).

Thematic roles in SRL
So far only few studies have considered VerbNetlevel granularity in SRL and we are not aware of SRL systems specifically designed to exploit the thematic role generalizations. Zapirain et al. (2008) compare PropBank and VerbNet performance using a simple SRL system and conclude that PropBank labels generally perform better; however, they do not use any additional modeling possibilities offered by VerbNet's general, predicate-independent role set. Loper et al. (2007) show that replacing verb-specific PropBank roles A2-5 with the corresponding VerbNet roles improves the SRL performance. Merlo and van der Plas (2009) report a statistical analysis of Prop-Bank and VerbNet annotations and conclude that while PropBank role inventory better correlates with syntax and is therefore easier to learn, Verb-Net thematic roles are more informative and better generalize to new verb instances. Finally, a recent comparison on German data by Hartmann et al. (2017) positions VerbNet inventory above FrameNet and below PropBank in terms of complexity and generalization capabilities; however, the experiment is again based on the mateplus sys-tem (Roth and Woodsend, 2014) designed with PropBank generalization level in mind.

Semantic Proto-Roles
A related line of work is Semantic Proto Role Labeling (SPRL) (Reisinger et al., 2015;White et al., 2017) which, following Dowty (1991), discards the notion of atomic semantic role inventory and replaces it with Proto-Agent and Proto-Patient property sets. While our study utilizes traditional atomic role inventories, we see SPRL as a compatible parallel line of work and believe that additional benefits can be gained by combining the two views on syntax-semantics interface. In particular, Reisinger et al. (2015) investigate the alignment between Dowty-style role properties and VerbNet thematic roles and show that VerbNet Agents tend to bear Dowty's instigated, awareness and volitional properties, while Themes are more likely to change posession, change state, etc.

Thematic hierarchies
Numerous THs have been proposed in the linguistic literature, e.g. Agent ≺ Instrument ≺ Theme (Fillmore, 1968); see (Levin and Rappaport Hovav, 2005) for an overview. These hierarchies are rarely applicable for NLP since they originate from different theoretical backgrounds and are usually focused on a narrow set of linguistic phenomena (e.g. subject selection), aiming to provide a cross-linguistically valid hierarchy based on a set of manually constructed examples. In contrast, our approach is data-driven and aims to describe the general syntactic behavior of thematic roles. While an optimal TH that would successfully describe semantic roles' behavior across languages might not exist (and would imply the existence of a universal role inventory and grammar), our evidence suggests that this concept is at least partially applicable.
To the best of our knowledge, there exists no prior work explicitly aiming at discovering thematic hierarchies in corpora. However, the hierarchy-related effects are reported in some studies. For example, White et al. (2017) observe on a reduced role set that VerbNet roles disprefer the violations of thematic/syntactic hierarchy alignment. Sun et al. (2009) experiment on thematic rank prediction for PropBank A0 and A1, but extend their analysis neither to VerbNet thematic roles, nor to the PropBank A2-5.

Syntactic formalisms
Cross-lingual applicability has traditionally been a strong component in semantic role theory, and universality is one of the common desiderata for a thematic hierarchy. This, however, implies the existence of a universal syntactic prominence scale.
From the NLP perspective, the closest to universal syntactic representation for which automatic parsers are available is the Universal Dependencies (UD) representation. Universal Dependencies (Nivre et al., 2016) is a recent initiative aimed at creating a single dependency-based formalism suited for describing syntactic structure in a language-independent way. It encompasses freely available treebanks for more than 60 languages, and universal dependency parsing is an active research area (Zeman et al., 2017). Based on that, we make an effort to ground our study in UD syntax for English. Since neither gold UD annotations, nor a deterministic converter are available, for German we use the TIGER dependency syntax representation (Dipper et al., 2001).

Hierarchical Linking model 3.1 Model
We suggest a simple model to describe the interface between syntactic and thematic rankings. An SRL corpus can be seen as a collection of sentences with corresponding predications, where each predication has a target (e.g. verb) and a set of arguments labeled with semantic roles.
Let a 1 ...a n ∈ A be the set of arguments in the predication p; r(a i ) be the role label af the argument a i , and d(a i ) be the path between the predicate and the argument in the dependency parse tree of the sentence. A syntactic ranker S provides a syntactic rank s i = S(d(a i )) for each argument a i in A based on the path, and a thematic ranker T provides a thematic rank t i = T (r(a i )) based on the argument's role. For each pair of arguments (a i , a j ) we expect their syntactic ranks to align with their thematic ranks, i.e.
The model per se does not imply the existence of a global ranking and allows flexible ranker definition. It allows ties in both syntactic and thematic rankings.
We use accuracy to assess how well a given syntactic-semantic ranker pair reflects the actual argument ranks found in data. Given a set of test predications p 1 , p 2 ...p k ∈ P with the argument sets A 1 , A 2 ...A k , we measure the correspondence between syntactic and semantic ranking over the argument pairs (a k i , a k j ) via accuracy defined as To avoid the majority class bias, we measure accuracy for each role pair and use macro-averaged accuracy over pairs as the final score. A straightforward alternative to our evaluation metric would be the Kendall rank correlation coefficient, which, based on our preliminary experiments, tends to overemphasize the performance on most frequent role pairs.

Thematic Hierarchy Induction
This paper investigates several thematic ranking strategies. As a running example we use a small role set: Agent (Ag), Patient (Pa), Instrument (In), Theme (Th) and Value (Va). For now we assume the following syntactic hierarchy: subj ≺ iobj ≺ nmod ≺ obj ≺ other.

Local ranker
The simplest way to model role ranking is to extract the average syntactic rank for each role based on the data, and then, given a test pair, assign ranks based on average syntactic rank. Pairwise ranker Given that roles often strongly prefer a certain syntactic position (also see (White  2016)), local ranking is a reasonable baseline strategy. However, it fails to account for the context dependency of thematic roles' syntactic realization. The next step is to construct a pairwise preference matrix: for each pair of roles encountered in training data we calculate the proportion of times role r i receives a higher syntactic rank than role r j . For our role set this results in the matrix shown on Fig. 1.
The preference matrix, for example, shows that Agent clearly dominates all the roles, Instrument ranks over Theme, and Value is below Theme.
Global ranker The pairwise ranking approach takes context into account. However, some role pairs only co-occur rarely. In such cases no pairwise ranking information is available to the model. Finding a global TH based on pairwise preferences is an example of a rank aggregation problem which can be solved via constrained ILP optimization on a preference graph (Conitzer et al., 2006). We represent the pairwise preference matrix as a graph G = (v, e) where each vertex v represents a role, the edge weight is the preference strength measured as #(r i ≺ r j )/#(r i , r j ). The edge direction is from higher-to lower-ranking role. If we assume a global ordering of the roles, we can induce the global ranking via transitivity relations. For example (Fig. 2), Instrument never appears with Value in our training data; however, by transitivity via Theme we can assume that Instrument ranks over Value.
Given the preference graph G = (v, e), let w ij be the weight of the edge between v i and v j . Let x ij ∈ 0, 1 denote that we rank vertice v i above v j .
The goal is then to maximize i,j x ij w ij subject to two groups of constraints. First, we prohibit two nodes to rank above each other, but allow ties, by enforcing ∀ i,j : x ij + x ji ≤ 1. Second, we enforce transitivity, i.e. if r i is ranked above r j , and r j is ranked above r k , then r i must be ranked above r k , formally ∀ i,j,k , i = j = k : x ij + x jk − x ik ≤ 1. We solve the ILP problem using the off-the-shelf pulp optimizer (Mitchell et al., 2011).
For our restricted example, optimization produces the following global hierarchy: Ag ≺ In ≺ Th ≺ Va/Pa. This hierarchy ranks Instrument above Value by transitivity, however, in case of Patient and Value no preference can be inferred from the graph, so they receive the same thematic rank.

Datasets and Restrictions
For our experiments on English, we use SemLink (Bonial et al., 2013), a manually constructed resource that enriches PropBank's (Palmer et al., 2005) semantic role annotations with VerbNet's (Schuler, 2006) thematic role labels. We use the Universal Dependencies converter (Schuster and Manning, 2016) to transform original PropBank syntactic annotation to UD. PropBank semantic role annotation and the corresponding SemLink reference are constituents-based. However, UD is a dependency formalism, and we employ a number of heuristics to align original PropBank annotations with the CoNLL-2009 datasets (Hajič et al., 2009) to recover the head node positions. We employ additional transformations, filtering out the predications in which not all PropBank core roles got aligned to the VerbNet thematic roles.
For German, we use the recently introduced SR3de dataset (Mújdricza-Maydt et al., Hartmann et al., 2017) which explicitly provides VerbNet annotations on top of SALSA corpus (Burchardt et al., 2006). There exist no gold UD annotations for the SALSA corpus, and we use the SALSA's default TIGER syntactic formalism (Dipper et al., 2001) in our experiments.
Following previous work, we employ certain restrictions on our data. Since thematic roles in both VerbNet and SR3de are only defined for verbal predicates, we restrict the scope of our study to verbs. We only consider direct dependents of the verbs in active voice, and since having access to the full argument set is important to study con-   Table 2. In all experiments we induce a TH and related statistics from the training data and evaluate it on the test data, using the split from the CoNLL SRL shared tasks.

Syntactic ranker
For simplicity in this paper we only experiment with two syntactic rankers per language. A common syntactic prominence scale assumed in linguistic literature is subject ≺ object ≺ indirect object ≺ oblique. This scale has to be adapted to the UD and TIGER labeling schemes. For each language we evaluate two syntactic rankings: one that positions objects above indirect objects and obliques, and one that positions objects below.
For English, we rank the UD syntactic relations as follows (SE1): nsubj / csubj ≺ iobj ≺ nmod ≺ ccomp / dobj ≺ other; where nmod corresponds to oblique and other is used for any other syntactic relation. An alternative ranking positions dobj directly after the subject (SE2): For German, the following ranking of TIGER syntactic relations is employed (SD1): SB ≺ DA ≺ OP / MO / OG/ OC ≺ OA / OA2 / CVC ≺ other; where SB is the subject, DA is dative object, OP / MO / OG / OC correspond to oblique relations, and OA / OA2 / CVC to direct object relations (see (Dipper et al., 2001) for detailed description). Similarly, we evaluate the performance of the ranking that positions the direct object after the subject (SD2): SB ≺ OA / OA2 / CVC ≺ DA ≺ OP / MO / OG / OC ≺ other. .456 .920 Table 3: Thematic ranker evaluation, incl. random ranker (RND) and upper bound (UB); bold -best result over syntactic rankers, underlined -best result over thematic rankers

Bounds
We construct the upper bound for the hierarchy induction by evaluating a global ranker trained on the test dataset. The upper bound reflects the data properties, as well as the maximal alignment accuracy that can be achieved with the selected syntactic ranker. The lower bound is constructed by evaluating 100 random thematic rankers which rank roles according to a random (but consistent) hierarchy, and averaging the result.

Data utilization setup
To evaluate how effective the proposed rankers use the training data, we conduct a series of experiments with reduced dataset sizes using the following protocol. The training dataset is shuffled and split into n = 100 slices. A ranker is consecutively trained on the first m ∈ 1..n slices and evaluated against the full test dataset. The procedure is repeated k = 100 times to eliminate the effect of data order, and the results per slice are averaged.

General Accuracy and Syntactic Ranker
To get an overall impression of the ranking quality, we first compare the performance of thematic rankers with respect to syntactic rankers and available datasets. The results of this comparison are summarized in Table 3 and show that syntactic rankers positioning the object second in the hierarchy (SE2 and SD2) lead to better alignment on both datasets and have a higher upper bound. We report the results on these rankers for the rest of the paper. For English the global hierarchy-based ranker approaches the upper bound, closely followed by the pairwise ranker. The accuracy on German data is lower and the pairwise and local rankers outperform the global hierarchy-based ranker. We revisit this observation in 6.5. EN Agent ≺ Cause/Instrument/Experiencer ≺ Pivot ≺   Table 5: Cross-lingual evaluation, global ranker

Qualitative analysis
The result of hierarchy induction is a global ranking of thematic roles. Table 4 shows full rankings extracted for English and German data. While some correspondence to the hierarchies proposed in literature is evident (e.g. for English Agent ≺ Instrument ≺ Theme, similar to (Fillmore, 1968)), a direct comparison is impossible due to the differences in role definitions and underlying syntactic formalisms. Notice the high number of ties: some roles never co-occur (either by chance or by design) or occur on the same syntactic rank (e.g. oblique) so there is no evidence for preference even if we enforce transitivity.

Cross-lingual hierarchy induction
The induced hierarchies for English and German bear certain similarities, which raises the question on cross-lingual applicability of the hierarchies. This analysis is only possible because the VerbNet and SR3de role inventories are mostly compatible with few exceptions (Mújdricza-Maydt et al., 2016). Table 5 contrasts the performance of THs induced from English and German training data, and evaluated on German and English test data respectively. While the cross-lingual performance is expectedly lower than the monolingual performance, it outperforms the random baseline by a large margin, suggesting the potential for crosslingual hierarchy induction.

Data utilization
One can assume that constructing a global hierarchy should require less training data due to the ef-  Table 6: Global ranker accuracy, English fective utilisation of transitivity. We evaluate this assumption empirically. Fig. 3 reports the performance of rankers with access to different amounts of training data for English and German. The results on English data show that global hierarchybased ranker effectively utilizes the training data and can be trained using just fractions of the original training dataset. The accuracy measurements on German are less conclusive: the local ranker generally performs best and learns fastest. We attribute this to the fact that filtered SR3de is an order of magnitude smaller than the PropBank/SemLink dataset. For pairwise and global rankers as many role pairs as possible should be observed at least once to establish the pairwise preference. This holds for PropBank/SemLink (all role pairs from test data seen at least once after observing 20% of the training data, on average), however, for filtered SR3de, even given the full training data, only 83% of role pairs from the test set have been seen at least once.

Error analysis
Our evaluation procedure allows detailed insights into the performance of the models. To illustrate, we extract the role pairs from English and German data with ranking accuracy below 1.0.  Error analysis on the much smaller German dataset (Table 7) reveals the sparsity-related issues: most of the role pairs that tend to get misaligned do not, or only rarely appear in the training data, heavily influencing the score. As on English data, many misalignments are due to simplicity of the syntactic ranker.

Importance of the syntactic ranker
The choice of syntactic ranking has a drastic effect on the resulting TH and the alignment quality, even if only direct syntactic dependents and a limited set of relations are taken into account. Realistically there might exist an arbitrary set of paths connecting arguments to predicates. UD as syntactic formalism is also subject to rapid change. Inducing a joint syntactic and thematic hierarchy that maximizes the overall alignment quality is a crucial direction for future work with potential benefits for SRL and syntactic parsing. Although we show that THs can be induced with an arbitrary dependency formalism, a cross-lingual UD-based study would be another extension to our work.

SRL integration
To utilize and evaluate the potential of thematic hierarchies for role interaction modeling, SRL integration is necessary. This, however, is not a trivial task: the absolute majority of semantic role labeling systems are designed with PropBank or FrameNet SRL formalism in mind and are not tailored to general VerbNet-style semantic roles and verb class-level disambiguation. A dedicated VerbNet SRL system would enable this assessment, and applying THs to such a system is an important future work direction.

Robustness to parsing errors
This paper focuses on TH induction using predefined syntactic annotation: a corpus annotated with semantic roles without an underlying syntactic layer is a rare occurence. However, for prac-tical applications and for the cases when an SRL corpus is provided without syntactic annotations, it would be important to evaluate how effectively THs can be induced given parsing errors in training and in test data.

Data selection
We have demonstrated that THs can be induced from small portions of training data. The large discrepancy in the scores on the first data slices seen on Fig. 3 suggests that some data instances are more informative for TH induction. This raises the question whether it is possible to automatically select useful training instances, supported by the evidence from previous work in SRL (Peterson et al., 2014). One obvious strategy would be to make sure that the hierarchy inducer is presented as many role pairs as early as possible. Approximating this objective in an unsupervised way would reduce the amount of data needed to induce a high-quality thematic hierarchy.

The need for a global hierarchy
Our results regarding the necessity of a global hierarchy which ranks all the roles are inconclusive. While global ranking reaches the best quality for English, on the German data pairwise and local ranking approaches perform best. Although we attribute the latter to sparsity, more German data would be needed to evaluate this hypothesis. In particular, this can be achieved by relaxing some of the constraints we impose on the data.

Conclusion
This paper has presented an empirical framework for thematic hierarchy induction and evaluation. We have suggested several syntactic and thematic ranking strategies and a method to induce global thematic hierarchies from corpus data. Analysis on English and German data shows that hierarchy induction is feasible, data-efficient and has potential for cross-lingual applications. Promising directions for future work include joint modeling of syntactic and thematic ranking, selecting informative training instances and evaluating the utility of global hierarchies on extended language material.