When Beards Start Shaving Men: A Subject-object Resolution Test Suite for Morpho-syntactic and Semantic Model Introspection

In this paper, we introduce the SORTS Subject-Object Resolution Test Suite of German minimal sentence pairs for model introspection. The full test suite consists of 18,502 transitive clauses with manual annotations of 8 word order patterns, 5 morphological and syntactic and 11 semantic property classes. The test suite has been constructed such that sentences are minimal pairs with respect to a property class. Each property has been selected with a particular focus on its effect on subject-object resolution, the second-most error-prone task within syntactic parsing of German after prepositional phrase attachment (Fischer et al., 2019). The size and detail of annotations make the test suite a valuable resource for natural language processing applications with syntactic and semantic tasks. We use dependency parsing to demonstrate how the test suite allows insights into the process of subject-object resolution. Based on the test suite annotations, word order and case syncretism can be identified as most important factors that affect subject-object resolution.


Introduction
Subject-object resolution remains a difficult task for syntactic parsing of languages with relatively free word order and case syncretism. For such languages, subject and object cannot always be disambiguated based on morpho-syntactic surface structures (Lenerz, 1977b;Eisenberg, 2013). Parser performance on German and Dutch, for example, has been shown to suffer significantly from incorrect identification of subject and object (Van Noord, 2007;Fischer et al., 2019). While there have been task-specific test suites for difficult syntactic phenomena in German such as prepositional phrase attachment, coordination, and verb phrase complementation (Nerbonne et al., 1991;Lehmann et al., 1996;Kübler et al., 2009), subjectobject resolution has widely been neglected in existing test suites.
In this paper, we introduce the SORTS Subject-Object Resolution Test Suite of German minimal sentence pairs. The test suite has been created manually from hand-selected subject-verb-object triples and contains 18,502 transitive clauses with Universal Dependency (UD, De Marneffe et al. (2014)) annotations, template-based annotations of 8 word order patterns and manual annotations of 16 morphological, syntactic and semantic property classes. For instance, the sentence Sie trifft eine Entscheidung. 'She makes a decision.' has been annotated with word order VF[S]LK[V]MF [O], the syntactic property subject pronoun and the semantic property light verb in addition to dependency relations relevant for subject-object resolution, as illustrated in Figure 1.
One domain of application for this test suite is syntactic parsing. We will use six neural dependency parsers with different architectures to show that the test suite is able to expose the linguistic properties that make subject-object resolution easier or more difficult for parsing. Experimental results imply that parsers are not able to resolve certain syntactic structures well, e.g. subject-object pairs with objectsubject order and case syncretism.
The paper is structured as follows: Section 2 gives an overview of subject-object resolution in German as background to the test suite description in Section 3. Section 4 shows an application of the test suite This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http: //creativecommons.org/licenses/by/4.0/.
1 The test suite is available online at https://github.com/DiveFish/SORTS and will continuously be extended. It is provided in CoNLL and sentence-based format, cf. Appendix A for samples from the test suite.
in German dependency parsing. Related work is presented in Section 5 before concluding with some guiding remarks in Section 6.
Sie trifft eine Entscheidung . she makes a decision .

Subject-object Orders in German
In German, the order and position of subject and object are relatively flexible within a clause. As shown in Example 1 (adapted from TüBa-D/Z UD, sentence #35072), subject and object can be swapped without changing the meaning of the sentence (Lenerz, 1977b). Subjects and objects can be identified based on morphological case marking for masculine singular nouns, determiners and the majority of pronouns. In the remaining noun and determiner paradigms, nominative and accusative forms overlap. This case syncretism can result in ambiguities between subjects and direct objects as in Example 2. Here, both das Ergebnis and Sie can be nominative or accusative -in contrast to Example 1 where euch can only be accusative. Subject-verb agreement can resolve subject and object if subject and object differ in morphological marking for number.
(1) a. German word order preferences such as the positions of subject and object have been described with the topological field model (Drach, 1937) that structures clauses in terms of fields: The middlefield (MF) is enclosed by the verbal bracket to the left (LK) and to the right (VC); the LK is preceded by the forefield (VF) in assertion main clauses. 2 The verb phrase is located in the LK and VC, subject and object in the VF or MF. The order of subject and object within the MF is relatively free. Table 1 lists examples of all subject-object orders and clause types specified with respect to topological fields.

Subject-object Resolution in Syntactic Parsing
Although subjects and objects can occur in any of the described positions within a German clause, a preference for subjects to precede objects can be observed. The TüBa-D/Z treebank of German newspaper text from the Berliner Tageszeitung taz with UD annotations (Telljohann et al., 2017;Çöltekin et al., 2017) includes 69,462 clauses with subject, verb and at least one object. 3 As Table 2 shows, the subject precedes the object in 81.31 percent of all clauses in the TüBa-D/Z UD. The distribution of subject-object orders in corpora is also reflected by attachment scores of syntactic parsers trained on such corpora. Table 2 shows relative frequencies and labeled attachment score (LAS) per order of subject, verb and object in the TüBa-D/Z UD test set (12,657 samples). Attachment scores are based on the De Kok and Hinrichs (2016) parser, from here on "baseline parser". The baseline parser has been trained on 70 percent of the TüBa-D/Z UD with overall LAS 89.89. For the TüBa-D/Z UD test set, subject-object order frequency and LAS of the baseline parser positively correlate (Pearson correlation coefficient ρ = 0.58). More frequent word orders are parsed correctly more often. Consequently, subjects and objects are expected to be identified with lower accuracy for the less frequent object-subject order.   For each of the 8 patterns included in Table 3, we settle on a set of base sentences that we describe next. To each base sentence, we apply up to two of the morphological, syntactic, or semantic variations described in Sections 3.2 and 3.3. We take as the base case the subjects, objects and verbs which are found most frequently in the TüBa-D/Z UD and parsed most accurately by the baseline parser. The properties of these subjects, objects and verbs are the following: • Subject and object noun phrases consist of a common noun preceded by a determiner. Table 4 shows that definite subjects occur most often with indefinite objects and are the easiest to parse of all definiteness combinations of subjects and objects. In order to avoid introducing biases of the linguistic expert creating the test suite sentences (first author of the paper), subjects, verbs and objects have been partially selected from frequent phrases in the TüBa-D/Z UD. Subjects and objects do not display case syncretism.  Table 4: Relative frequency and subject-object LAS of definite and indefinite subjects and objects in the TüBa-D/Z UD test set. Definite subjects before indefinite objects are the most frequent subject-object combination with the highest LAS.

Subject
• Subjects have been restricted to be animate whereas objects are inanimate. Dowty (1991) describes the agent as prototypical subject of a clause. According to Eisenberg (2013), the degree of agency also decreases from subject to object. Since animacy is considered to correlate with agency, the subject has been defined to be animate and the object to be inanimate. One exception are psych verbs with experiencer objects as in Das Auto gefällt ihnen. 'The car pleases them.' where agency increases from subject to object (Lenerz, 1977b;Bader and Häussler, 2009).
• Unstressed object pronouns preferably occur at the left edge of the MF (Wackernagel position, Eisenberg (2013)). This affects the markedness of different subject-object orders: Sentences with pronominal object before a nominal (i.e. non-pronominal) subject will be less marked than a nominal object before a nominal subject. In order to avoid that different degrees of markedness influence subject-object resolution, subject and object have been defined not to be pronouns • Verbs are in the active voice, do not include separable particles, modals or auxiliaries. The present tense has been used in all sentences. Furthermore, the chosen verbs take accusative objects as their argument. Monotransitive verbs with accusative object are by far more frequent than monotransitive verbs with dative objects (88.61 percent accusative compared to 8.50 percent dative objects in TüBa-D/Z UD). Accusative objects therefore allow a greater variety of verbs. Ditransitive verbs have been excluded in order to avoid that additional arguments introduce new sets of word order preferences which would in turn make a focused inspection of factors that affect subject-object resolution more difficult.
Subsections 3.2 and 3.3 introduce morphological, syntactic, and semantic properties that serve as the source of variation from the base case that is described in the present subsection.

Morphological and Syntactic Variations
Object case. The SORTS test suite contains monotransitive clauses with accusative and with dative objects. A comparison of clauses with accusative and dative objects can shed light on the effect of case on subject-object resolution.
Case syncretism. Case marking in German is not always unique. If both subject and object cannot be clearly identified based on their case and number markings, other cues to resolve subject and object need to be used. By enforcing case syncretism in parts of the test suite, it becomes possible to investigate how NLP systems deal with case syncretism. In combination with other selected properties, it can also be tested which properties make subject-object resolution for ambiguous subject-object sequences easier.
Pronominalization of subject and object. Pronouns have an effect on the preferred order of subject and object. In verb-last clauses with pronominal object, the object preferably occurs before a nonpronominal subject. Subject-object resolution should thus be easier for these clauses compared to other clauses where the object precedes the subject. If both subject and object are pronouns, the object cannot occur before the subject in verb-last clauses. Such sentences are considered ungrammatical and are therefore not part of the test suite.
Object negation. Results from experiments with the baseline parser showed that negated objects such as keine Zeitung 'no newspaper' are parsed correctly more often (LAS 95.19) than non-negated objects such as eine Zeitung 'a newspaper' (LAS 83.10). A variant with negated objects allows more focused investigations of the effect of negation on subject-object resolution.
Verb-argument distance: main verb position. Selectional restrictions originate from the main verb which selects its arguments such as the subject and the object. Making the verb phrase more complex by the help of an auxiliary verb changes the position of the main verb, increasing the distance between the main verb and its arguments. The auxiliary verb werden 'will' was used because it is not restricted to a particular class of verbs, in contrast to the auxiliary haben 'to have' or modal verbs such as können 'can/be able to'. Thus, werden serves as a means to test whether subject-object resolution becomes more difficult if the main verb and its arguments are further apart.
Verb-argument distance: additional constituents. As a seconds means to increase the distance between the main verb and its arguments, the temporal prepositional phrase (PP) in dem Jahr 'in that year' as the most frequent PP in the TüBa-D/Z UD 5 has been added in one sentence variant. The PP is inserted after the verb, if possible, as shown in Example 3. This avoids introducing further ambiguities such as PP attachment ambiguities. (

Semantic Variations
Animacy. With respect to the semantic property of animacy, four co-occurrence patterns of subject and object have been included in the SORTS test suite: animate subject, inanimate object (base case pattern: no variation); inanimate subject, inanimate object (with deviation from case pattern: inanimate subject); animate subject, animate object (with deviation from base case pattern: animate object); inanimate subject, animate object (with deviation from the base case pattern for both subject and object). Suitable phrases were manually created, humans and other animal species being considered animate. The decrease in agency from subject to object does not hold for psych verbs with experiencer objects such as to amuse, to please or to frighten where the object is typically more agentive than the subject (Lenerz, 1977b;Temme, 2018). For this reason, objects of psych verbs are animate, subjects inanimate, and psych verbs do not exhibit any of the variations for animacy described in the previous paragraph.
Regular polysemy. Languages tend to include nouns that exhibit regular patterns of polysemy (Apresjan, 1975). Nouns such as university and company can either refer to a set of people (in this reading: exhibiting the property of animacy) or a set of buildings (in this reading: lacking the animacy property). In order to determine whether a given occurrence of such nouns is more likely to be the subject or the object, we have to resolve the correct word sense of the noun. Instances of regular polysemy have been selected from the lexical-semantic word net GermaNet (Hamp and Feldweg, 1997;Henrich and Hinrichs, 2010). In all sentences, the animate readings have been used as subjects, since animate noun phrases are more agentive than inanimate noun phrases (cf. Section 3.1).
Proper name subjects. While common nouns are marked for case and number, proper nouns are marked for genitive case only. In addition, they are not accompanied by a determiner that could carry morphological information. In sum, proper nouns do not provide information about nominative or accusative and were therefore selected as more difficult cases of subject-object resolution.
Semantic asymmetry. Verbs can express actions that can only be executed between specific subjects and specific objects. The roles of subject and object cannot be swapped for these subject-verb-object combinations. One example is to teach with subject-object pairs such as professorstudent. If not impossible, it is yet rare that the student teaches the professor. Other examples are to command, to arrest etc. The correct subject-object resolution is particularly challenging for such verbs since the correct asymmetric relation between the participants has to be established. Due to the lack of existing resources, examples of subject-object asymmetry have been manually created.
Non-referential objects. Non-referential objects such as inanimate nichts 'nothing' and animate niemanden 'nobody' make it possible to study the effects of object animacy and inanimacy on subject-object resolution, excluding other factors that may be due to the descriptive content of the head noun of a noun phrase.
Light verb constructions. Light verb constructions differ from other verb-object pairs in that the meaning is mostly derived from the object and the "light" verb contributes only little to the meaning of the phrase (Eisenberg, 2013). Examples are eine Entscheidung treffen 'to make a decision' or eine Frage stellen 'to pose a question'. Since the verb and the object form a unit of meaning, changes in word order should have less of an effect on subject-object resolution than with more loosely related verb-object pairs. Light verbs have been picked from the most frequent verb-object combinations in the TüBa-D/Z UD.
Verb synonyms. The meaning of a sentence as a whole is not affected by a replacement of the main verb by one of its synonyms. The same holds for the syntactic analysis of the sentence and its synonymous counterpart. If the syntactic analyses differ, the verb synonyms have erroneously been treated as two distinct, non-related verbs. Synonyms have been retrieved manually from Duden (2019) and an online thesaurus 6 by a German native speaker, aiming for minimal semantic differences between the original verb and its synonym.
Idioms. In idioms, subject, object and verb form one unit of meaning (Duden, 2019). The variation of idioms into all different word orders allows insights into the importance of the syntactic structure of idioms. Idioms have been manually selected from a list of German idioms 7 , including only idioms which consist of subject, verb and object without any additional phrases such as relative or coordinate clauses, adjuncts etc.

Degrees of Variation
Sentences with up to two variations from the base sentences, which were described in Section 3.1 above, were manually created for the SORTS test suite. Table 5 shows the different degrees of variation at an example sentence. These different degrees of variational depth make it possible to test if certain properties or property combinations make subject-object resolution easier or more difficult.

Variation
Example Property 0: No variation Der Leser abonniert eine Zeitschrift. base 'the reader subscribes to a newspaper' 1: Auxiliary verb Der Leser wird eine Zeitschrift abonnieren. aux 'the reader will subscribe to a newspaper' 2: Auxiliary verb, synonymous verb Der Leser wird eine Zeitschrift bestellen.
aux-syn 'the reader will order a newspaper'

Data Subsets
In order to determine the effect of ambiguity between subject and object on subject-object resolution, two data subsets have been created. In test suite SORTS part-amb (partially ambiguous), subject and object can be identified based on morphological information on subject and object, the only exception being one variation in which case syncretism occurs between subject and object (10,839 sentences). In test suite SORTS amb (fully ambiguous), all sentences have been manually changed to be ambiguous (7,663 sentences). In that set, the case syncretism variant has been removed along with the dative object variant for which no subject-object ambiguity occurs.

Parser Models
A neural transition-based dependency parser with a feed-forward neural network of one hidden layer serves as the baseline (De Kok and Hinrichs, 2016). In this Chen-and-Manning-(2014)-style parser, words and parts-of-speech are represented as structured skipgram embeddings (wang2vec, Ling et al. (2015)). In the case of word embeddings, subword units (Bojanowsky et al., 2017) are also used. Information about topological fields is provided as one-hot encodings.
The baseline parser has been extended in two ways: 1) Normalized PMIs from large corpora which indicate if and with which label a token should be attached to a candidate head, 2) similarity scores of dependency embeddings that have been trained to maximize the probability of two tokens occurring in a dependency relation in the training corpus (Fischer et al., 2019).
The simple feed-forward neural network of the baseline parser provides a very limited view of a token's context. The sticker1 model builds on bidirectional LSTMs (Hochreiter and Schmidhuber, 1997;Graves and Schmidhuber, 2005) which capture the left and right context of a token. The dependency edges are encoded as in Spoustová and Spousta (2010) and Strzyz et al. (2019). Sticker1-self-distilled makes additional use of self-distillation (Hinton et al., 2015;Furlanello et al., 2018). Both sticker1 models use the same word embeddings as the baseline parser. Sticker2 uses XLM-RoBERTa (Conneau et al., 2019) finetuned on the TüBa-D/Z UD corpus for various morpho-syntactic tasks, including dependency parsing. 8 For consistent tokenization between all parsers, sticker2 uses SentencePieces (Kudo and Richardson, 2018) on the token level.

Results and Error Analysis
The six parsers have been tested on the SORTS part-amb and SORTS amb test suites. Due to the focus on subject-object resolution, task-specific attachment scores are reported for subject and object heads and labels, discarding all other attachments.   Table 6 shows subject-object LAS of all parsers for test suites SORTS part-amb and SORTS amb . Results for SORTS part-amb are given including and excluding variants with case syncretism, the latter denoted as SORTS no-amb (no ambiguities). In addition to model LAS, scores for always choosing subject-object order in contrast to object-subject order are provided as Subject-first. Attachment scores decrease with a larger number of ambiguous sentences in the test suites just as improvements shrink over simply choosing the first verbal argument as the subject. Sticker2 performs best on all test suites. It may benefit from having a deeper network (e.g. 12 layers compared to 3 layers in sticker1-self-distilled) and from pretraining on larger amounts of more varied data. Property-specific results are given for the SORTS amb test suite. The lack of morphological indicators for subject-object identification makes the linguistic properties from Sections 3.2-3.3 more easily accessible than in SORTS part-amb (cf. Appendix C and D for subject-object LAS and baseline improvements for that test suite). Table 7 shows absolute baseline LAS and LAS improvements over the baseline parser for the five non-baseline parsers (cf. Appendix B for absolute LAS of all parsers). Linguistic properties are split into word order, morphological/syntactic and semantic properties. Results per property class represent task-specific subject-object LAS across all sentences to which the property applies. As suggested by the overall results, sticker2 outperforms all other parsers by a wide margin for most of the linguistic properties. For SORTS part-amb , sticker2 achieves the best results on all but one property class with wider margins than for SORTS amb .

Parser
In the word order category, more frequent subject-object orders show smaller improvements over the baseline than less frequent object-subject orders. One reason is that absolute LAS is already high for subject-object orders whereas results for object-subject orders range below 40.00 LAS points. As has been shown in Section 2.1, less frequent word orders are the most difficult for subject-object identification, in particular in combination with case syncretism between subject and object. However, these are also the cases where most can be gained from parsers with contextualized word representations that also take advantage of large amounts of training data. Results on the word order property in SORTS part-amb confirm these findings. For M F [SO] orders, the larger number of parameters in the sticker models compared to the single feed-forward layer model of the baseline may explain the improvements over the baseline of sticker1 and sticker2.
Within the morphological and syntactic category, sticker2 improves considerably over the baseline for longer sentences with auxiliary verbs and PPs. On the semantic properties, sticker2 outperforms other models in particular for proper name subjects and highly non-compositional clauses such as idioms.
For the latter, more training data seems to support more semantic representations of words and phrases which facilitates subject-object identification across different word orders. Low scores of sticker2 for animate objects may originate from the fact that sticker2 uses different word representations than the other parsers. sticker2 uses the multilingual XLM-RoBERTa sentence piece vocabulary (Conneau et al., 2019) which consists of 250,000 pieces for more than 100 languages. The other parsers use a monolingual German word embedding vocabulary consisting of 710,288 words. Only one out of 34 animate objects is directly included in the vocabulary of the sticker2 model which may have negative effects on the performance of sticker2. Similar data sparsity issues may be the reason for mixed results of the baseline parser with PMIs. Absolute subject-object LAS for proper name subjects and idioms confirm that sticker2 improves the most on properties with relatively low baseline performance. Variations in performance between different properties for each individual parser underscore the utility of structuring the test suite according to word order, morphological, syntactic and semantic variations.

Property (frequency)
Parser Baseline Baseline Baseline sticker1 sticker1 sticker2 +PMIs +embeds +self-distill Word order (  Table 7: Property-specific baseline subject-object LAS and baseline improvements on the SORTS amb test suite. The property frequency is included as the number of sentences per property.

Related Work
Grammar coverage and correctness. In rule-based parsing, parser performance was dependent on the correctness and coverage of grammar rules and the lexicon. Much work was devoted to testing and ensuring grammar coverage (Burkhardt, 1967;Purdom, 1972;Harrison et al., 1991). Error mining revealed weaknesses in the grammar by categorizing errors into classes that could be covered by adding new rules to the grammar (Van Noord, 2004;Sagot and De la Clergerie, 2006;De Kok et al., 2009).
Another approach to testing grammar coverage and correctness has been the design of test suites, with a noticeable interest in test suites on German syntax: Nerbonne et al. (1991) introduced a catalogue with the major syntactic patterns in German in order to facilitate error detection of NLP systems; Kübler et al. (2009) built a test suite for complex German constructions such as PP attachment, subject gaps and coordination of unlike constituents.
Test suite formats can range from unannotated lists of samples grouped by linguistic phenomena (Flickinger et al., 1987) to sets with detailed annotations such as TSNLP -Test Suites for NLP (Lehmann et al., 1996). Many previous attempts to create test suites were hampered by the diverse landscape of annotation schemes. The intent to make test suites more widely accessible led to efforts as the one by Kübler et al. (2009) who specifically provided test sentences in multiple annotation schemes. Universal Dependencies (De Marneffe et al., 2014) have successfully pushed the development of a language-independent annotation scheme. UD annotations have thus been used in the SORTS test suite. As they are not bound to a language-specific annotation scheme they make the test suite more applicable for international research in syntactic parsing and other fields where UD is now the de-facto standard annotation.
Probing. Increasing efforts are being made to investigate how linguistic structures are represented in neural networks. Linzen et al. (2016) probed an LSTM architecture's grammatical competence using training objectives with number prediction and grammaticality judgments in English as a target. Shwartz and Dagan (2019) present an evaluation suite consisting of tasks related to lexical composition, such as recognizing light verb and verb-particle constructions. Giulianelli et al. (2018) investigated how neural language models keep track of subject-verb agreement. Minimally-differing sentence pairs as they have been created for the SORTS test suite have also been used in the data sets by Poliak et al. (2018) and Ettinger et al. (2018).

Conclusion and Future Work
We presented the SORTS test suite for model introspection into subject-object resolution via German minimal sentence pairs. With a total size of 18,502 transitive clauses and 24 syntactic-semantic property classes, the test suite is a valuable resource for inspecting syntactic NLP systems of German. Its application in syntactic parsing revealed weaknesses of all parsers when syntactic and morphological cues are insufficient to resolve subjects and objects. Particularly difficult are sentences with object-subject order and case syncretism of subject and object. How to best resolve such sentences will remain an interesting case for future research on parsing.
Another direction for future work focuses on the extension of the test suite to cover more languages and linguistic phenomena. German still provides several cues to subject-object resolution: Case marking and subject-verb agreement. In Dutch, only some remainders of case marking persist in the pronoun paradigm whereas nominal subjects and objects can only be disambiguated morpho-syntactically through subject-verb agreement. For this reason, a second test suite is currently being constructed for Dutch. For ease of comparison, it will to a large extent be a translation of the German test suite.
Subject-object resolution is only the second largest error class behind PP attachment. A second test suite subset will be dealing with different aspects of PP attachment. In contrast to subject-object resolution, PP attachment has the advantage that it applies to a wider range of languages. The test suite in sentence-based format provides the sentences with each sentence being annotated for the different properties that apply for that sentence. Word order in column 1 is separated from morphological, syntactic and semantic properties in column 2. Subject and object indices for easy subject-object identification are given in column 3 and 4. This format is particularly suited for architectures which take as input full sentences. Again, the example shows one of the sentences in the property combination of auxiliary verb and light verb construction aux-vlight in all possible word orders.