Suitability of ParTes Test Suite for Parsing Evaluation

Parsers have evolved signiﬁcantly in the last decades, but currently big and accurate improvements are needed to enhance their performance. ParTes, a test suite in Spanish and Catalan for parsing evaluation, aims to contribute to this situation by pointing to the main factors that can deci-sively improve the parser performance


Introduction
Parsing has been a very active area, so that parsers have progressed significantly over the recent years (Klein and Manning, 2003;Collins and Koo, 2005;Nivre et al., 2006;Ballesteros and Nivre, 2012;Bohnet and Nivre, 2012;Ballesteros and Carreras, 2015). However, nowadays significant improvement in parser performance needs extra effort.
A deeper and detailed analysis of the parsers performance can provide the keys to exceed the current accuracy. Tests suites are a linguistic resource which makes it possible this kind of analysis and which can contribute to highlight the key issues to improve decisively the Natural Language Processing (NLP) tools (Flickinger et al., 1987;EAGLES, 1994;Lehmann et al., 1996). This paper presents ParTes 15.02, a test suite of syntactic phenomena for parsing evaluation. This resource contains an exhaustive and representative set of structure and word order phenomena for Spanish and Catalan languages (Lloberes et al., 2014). The new version adds a development data set and a test data set.
The rest of the paper describes the main contributions in test suite development (Section 2). Section 3 shows the characteristics and the specifications of ParTes. The results of an evaluation task of the FreeLing Dependency Grammars (FDGs) with verb subcategorization information   (Lloberes et al., 2010) using ParTes are discussed in Section 4. Finally, the main conclusions and future work are exposed (Section 5).

Test suite development
The main aim of qualitative studies is to offer empirical evidence about the richness and precision of the data, in comparison with quantitative studies which provide a view of the actual spectrum (McEnery and Wilson, 1996). For this reason, qualitative analysis are deep and detail-oriented, while quantitative analysis focus on statistically informative data. In the qualitative approach, representativeness of the studied phenomena focuses on exhaustiveness rather than frequency, which is the base of the quantitative approach. Both approaches are not exclusive because they contribute to build a global interpretation. While corpora are a large databases of the most frequent linguistic utterances (McEnery and Wilson, 1996), test suites are controlled and exhaustive databases of linguistic utterances classified by linguistic features. These collections of cases are internally organized and richly annotated (Lehmann et al., 1996). Controlledness, exhaustiveness and detailedness properties allow these databases to provide qualitatively analyzed data.
They were developed in parallel with the NLP technologies. The more sophisticated the software became, the more complex the test suites evolved to be (Lehmann et al., 1996). From a collection of interesting examples, they transformed into deeply structured and richly annotated databases (Table 1), such as the HP test suite (Flickinger et al., 1987), the test suite developed by one of the groups of EAGLES (EAGLES, 1994), the TSNLP (Lehmann et al., 1996) and the corpus of unbounded depdendencies (Rimell et al., 2009).
Concerning the languages of this study, a test suite for Spanish was developed by Marimon et al. (2007). The goal of this test suite is to assess the development of a Spanish Head-Driven Phrase Structure Grammar and it offers grammatical and agrammatical test cases.

The ParTes test suite
This test suite is a hierarchically structured and richly annotated set of of syntactic phenomena for qualitative parsing evaluation available in Spanish (ParTesEs) and Catalan (ParTesCa) and freely distributed under the Creative Commons Attribution-ShareAlike 3.0 Unported License. 1 The new release of ParTes (15.02) consists in the improvement of the linguistic data sets. Initially, ParTes included a test data module formed by sentences illustrating the syntactic phenomena of the test suite (Lloberes et al., 2014). The current version incorporates a set of linguistic data for development purposes that extends the capabilities of the test suite by allowing the parser development monitoring and a second iteration of the evaluation task.
This resource has been created following the main contributions in test suite design (Flickinger et al., 1987;EAGLES, 1994;Lehmann et al., 1996). The main feature shared with the existent test suites is the control over the data, which makes it possible to work as a qualitative evaluation tool. Furthermore, ParTes adds the concepts of complexity of the resource organization, exhaustiveness of the phenomena descriptions and representativity of the phenomena included.
ParTes is a test suite of syntactic phenomena annotated with syntactic and meta-linguistic information. The content has been hierarchically structured by means of syntactic features and over two major syntactic concepts (Figures 1 and 2): structure and word order.

Test suite specifications
The current version contains a total of 161 syntactic phenomena in ParTesEs (99 relate to syntactic structure and 62 to word order) and a total of 145 syntactic phenomena in ParTesCa (99 concern to syntactic structure and 46 to word order).
The structure phenomena have been manually collected from descriptive grammars (Bosque and Demonte, 1999;Solà et al., 2002) and represented following the criteria of the FDGs (Lloberes et al., 2010). The selection of phenomena has been validated by the dependency links frequency of the AnCora Corpus (Taulé et al., 2008).
As Figure 1 shows, the first level of the hierarchy determines the level of the syntactic phenomenon (inside a chunk or between a marker and the subordinate verb). The second level expresses the phrase or the clause involved in the syntactic phenomenon (constituent) and the third level describes the position (head or child) in the hierarchy. Finally, a set of syntactic features describes the type of constituent observed (realization).
For every syntactic structure phenomenon, two linguistic examples have been manually defined, one of them to be used for development purposes (devel) and the other one for testing purposes (test). The lemmas of the parent and the child of the exemplified phenomenon are also provided (parent devel, parent test, child devel, child test).
Word order in ParTes is semi-automatically built from the most frequent argument structure frames of the SenSem Corpus (Fernández and Vàzquez, 2014).
The hierarchy about the word order is structured firstly by the number and the type of arguments of the word order schema (class), as Figure 2 illustrates. Every class is defined by a set of schemas about the number of arguments and their order. The most concrete level (realization) describes the properties of the schema.
These properties refer to the syntactic function (func) 2 and the grammatical category (cat) of every argument of the schema. Furthermore, the type of construction (constr) where the schema occurs in and the type of subject (sbjtype) are provided. The occurrence frequency of the schema in the SenSem Corpus is associated (freq). In addition, a numeric id is assigned to every schema and a link to SenSem Corpus sentences with the same schema is created (idsensem).
Every schema recorded is exemplified with a sentence for testing purposes (test). For every test sentence, the lemmas of the parent and the children corresponding to the head of the arguments of the schema are added.

Description of the data sets
The development and the test data are built over the manually defined linguistic examples of the syntactic phenomena of ParTes.
The sentences have been automatically annotated by using the FDGs, so that a complete dependency analysis of the whole sentence is offered. The output has been reviewed manually by two annotators: a native in Spanish responsible for the annotation of ParTesEs and a native in Catalan who annotated the ParTesCa. A second manual revision has been performed: the Catalan annotator reviewed the ParTesEs annotated and the Spanish annotator reviewed the ParTesCa annotated guaranteeing the agreement between the annotations in both languages and preserving the quality of the annotation according to the criteria.
Up to the current version, the number of sentences referring to the syntactic structure are: 95 sentences in the ParTesEs development data set, 99 sentences in the ParTesEs test data set, 98 sentences in the ParTesCa development data set and 99 sentences in the ParTesCa test data set. The data sets are distributed in plain text format and in the CoNLL annotation format (Nivre et al., 2007).

Evaluation task
In order to test the usability of ParTes for parsing evaluation, it has been applied as a gold standard in an evaluation task of the FDGs. Particularly, the capabilities of the test suite have been tested for explaining the performance of FDG as regards the argument recognition since it still remains to be solved successfully (Carroll et al., 1998;Zeman, 2002;Mirroshandel et al., 2013).
The FDGs are the core part of the rulebased FreeLing Dependency Parser (Padró and Stanilovsky, 2012). They provide a deep and complete syntactic analysis in the form of dependencies. The grammars are a set of manually-defined rules that comple the structure of the tree (linking rules) and assign a syntactic function to every link of the tree (labelling rules) by means of a system of priorities and a set of conditions. Two FDGs versions for both languages have been evaluated: a version without verb subcategorization classes (Bare) and a version with verb sub-   (Fernández and Vàzquez, 2014). The system analysis built for every version of the grammars is compared to the ParTes analysis using the evaluation metrics of the CoNLL-X Shared Task (Nivre et al., 2007). 3 According to the accuracy results (Table 2), the evaluation with ParTes shows that FDGs performance is medium-accuracy (near or above 80% in LAS). Both versions of the grammar in both languages perform in high-accuracy in terms of attachment (UAS), whereas they obtain medium accuracy on syntactic function labelling (LA). ParTes data highlight that the Subcat grammar scores better than the Bare grammar in LA, which is directly related to the addition of subcategorization classes, as stated in the following discussion.
A detailed observation reveals that ParTes sentences related to subcategorization are performed better in precision by Subcat rather than Bare (Table 3). Furthermore, the test data allows to show that subcategorization has more impact in the recognition of the majority of arguments (dobj, pobj, pred) and the subject (subj) than in the adjuncts (adjt) because the precision scores increment is higher. Subcategorization do not have an effect on the attribute (attr) because it can be solved lexically. The indirect objects (iobj) correspond to cases of dative clitic, which are solved by morphological information.
The integration of subcategorization information bounds the rules to the verbs included in the classes. Consequently, some cases may be not captured if the verb is not expected by the subcategorization classes as it happens in the prepositional object (pobj). For example, the prepositional argument of the sentence 'Ha creido en sí mismo' ('He has believed in himself') should 3 Labeled Attachment Score (LAS): the percentage of tokens with correct head and syntactic function label; Unlabeled Attachment Score (UAS): the percentage of tokens with correct head; Label Accuracy (LA): the percentage of tokens with correct syntactic function label; Precision (P): the ratio between the system correct tokens and the system tokens; Recall (R): the ratio between the system correct tokens and the gold standard tokens.   Table 4: Recall scores of FDG on ParTes be labelled as pobj, but the adjt tag is assigned because the verb 'creer' is not in any of the prepositional argument classes of the grammar. However, in the majority of types of arguments and the adjuncts the recall is maintained or increased (Table 4).

Conclusions
The new version of the ParTes test suite for parsing evaluation has been presented. The main features and the data sets have been described. In addition, the results of an evaluation task of the FDGs with ParTes data have been exposed. The characteristics of the test suite made it possible to analyze in detail the causes of the performance improvement on the argument recognition of the FDGs including subcategorization information. Therefore, these results show that ParTes is an appropriate resource for parsing evaluation.
Currently, ParTes is extended to English following the methodology explained in this paper. In the upcoming releases, test and development sentences belonging to the word order will be incorporated in the ParTes data sets. Furthermore, we are exploring a systematic methodology to generate agrammatical variants of the existent sentences.