Improving Polish Mention Detection with Valency Dictionary

This paper presents results of an experiment integrating information from valency dictionary of Polish into a mention detection system. Two types of information is acquired: positions of syntactic schemata for nominal and verbal constructs and secondary prepositions present in schemata. The syntactic schemata are used to prevent (for verbal realizations) or encourage (for nominal groups) constructing mentions from phrases filling multiple schema positions, the secondary prepositions – to filter out artificial mentions created from their nominal components. Mention detection is evaluated against the manual annotation of the Polish Coreference Corpus in two settings: taking into account only mention heads or exact borders.


Introduction
Coreference resolution systems are believed to suffer from lack of integration of "deeper" knowledge, with respect to both semantics and world knowledge, while it has been recognized from the very beginning (Hobbs, 1978) that they make very important and at the same time difficult factors in the process 1 and that present attempts of integration of such features are bringing only small improvements to the overall accuracy (see next section for examples). The slow progress in solving complex semantics-or knowledge-related issues we are experiencing today is promoting the switch into the search of new algorithms and models and probably also adds to the general loss of global interest in coreference resolution.
Nevertheless we argue that such situation should not be considered as failure of semantic approaches but rather as a consequence of enormous dimensions and complication of the knowledge system which needs to be applied to linguistic processing, including reference decoding. On the contrary, we believe that the method of small steps towards the big goal is constantly bringing useful models and resources to the field, year by year growing in size and complexity. It is particularly important for languages other than English where more subtle properties of semantic constructs can influence the results.
In the current paper we show how integration of a relatively simple rule taking into consideration verbal and nominal valency in Polish slightly but consequently improves mention detection scores.
2 Related Work (Kehler et al., 2004) integrated preferences inferred from statistics of subject-verb, verb-object and possessive-noun predicate-argument frequencies into a pronoun-based resolution system which resulted in 1% accuracy improvement. Several works integrating semantic processing into coreference resolution were also proposed, e.g. (Ponzetto and Strube, 2006b) integrated predicateargument pairs into (Soon et al., 2001)'s resolution system which yielded 1.5 MUC F 1 score improvement on ACE 2003 data. (Ponzetto and Strube, 2006a;Ponzetto and Strube, 2007) used Wikipedia, WordNet and semantic role tagging to compute semantic relatedness between anaphor and antecedent to achieve 2.7 points MUC F 1 score improvement on ACE 2003 data. (Rahman and Ng, 2011) labelled nominal phrases with FrameNet semantic roles achieving 0.5 points B 3 and CEAF F 1 score improvement and used YAGO type and means relations achieving 0.7 to 2.8 points improvement on OntoNotes-2 and ACE 2004/2005 data. (Durrett and Klein, 2013) incorporated in their system shallow semantics by using WordNet hypernymy and synonymy, number and gender data for nominals and propers, named entity types and latent clusters computer from English Gigaword corpus, reaching 1.6 points improvement on gold data and 0.36 points on system data.
For Polish, WordNet and Wikipedia-related features were used to improve verification of semantic compatibility for common nouns and named entities in BARTEK-3 coreference resolution system (Ogrodniczuk et al., 2015, Section 12.3) resulting in improvement of approx. 0.5 points MUC F 1 score. Experiments with integration of external vocabulary resources coming from websites registering the newest linguistic trends in Polish, fresh loan words and neologisms not yet covered by traditional dictionaries have been also performed showing low coverage of new constructs in evaluation data (Ogrodniczuk, 2013).
All these results showed challenges regarding knowledge-based resources, mainly concerning the memory and time complexity of the task as well as low coverage of complex features in the test data, but at the same time brought some (sometimes tiny) improvements to coreference resolution scores.

Problem Definition
In our approach mentions are defined as text fragments (nominal groups including attached prepositional phrases and relative clauses) which could potentially create references to discourse world objects. Such definition has both syntactic and semantic grounds: inclusion of extensive syntactically dependent phrases into mention borders is important due to semantic understanding of mentions: pierwszy człowiek na Księżycu 'the first man on the Moon' or samochód, który potrącił mojążonę 'the car which hit my wife' have different meanings than just człowiek 'the man' or samochód 'the car'. One of the consequences of this distinction is treating as mentions all embedded phrases with heads distinct from the head of the main phrase (meaning that they corresponded to different entities). Therefore, in the example: (1) szef działu firmy 'the head of the branch of the company' three noun phrases should be considered as mentions referring to, accordingly, 'the head of the branch of the company', 'the branch of the company' and 'the company' itself. The need of exact mention border detection stands in contradiction with unavailability of a constituency parser for Polish with sufficient coverage 2 which could solve most of the attachment problems. Current state-of-the-art mention detector for Polish (see Section 4.3) identifies nominal groups with a relatively old Spejd shallow parser. Our work attempts to use valency schemata from a recently created valency dictionary for Polish (see Section 4.1) for two purposes: to prevent mention borders to cross positions of a syntactic schema and to filter out mentions created from nominal components of secondary prepositions, also present in the valency dictionary.

Walenty, a Polish Valence Dictionary
Walenty (Przepiórkowski et al., 2014) 3 is a comprehensive human-and machine-readable dictionary of Polish valency information for verbs, nouns, adjectives and adverbs. It consists of two interconnected layers, syntactic and semantic, and features precise linguistic description, including the structural case, clausal subjects, complex prepositions, comparative constructions, control and raising and semantically defined phrase types. Lexicon entries have strictly defined formal structure and the represented syntactic and semantic phenomena are always attested in linguistic reality, with the National Corpus of Polish (Przepiórkowski et al., 2012, later referred to as NKJP) as a primary source of data and Internet and linguistic literature as secondary sources.
Each lexical entry is identified by its lemma and consists of a number of syntactic valence schemata with each schema being a set of syntactic positions. Apart from the two labeled argument positions, subject and object, usual phrase types are considered, such as nominal phrases (NP), prepositional phrases (PREPNP), adjectival phrases (ADJP), clausal phrases (CP), etc. Phrase types can be further parameterised by corresponding grammatical categories, e.g., NP and ADJP are parameterised by information concerning case. The underscore symbol '_' denotes any value of a grammatical category, e.g., INFP(_) denotes infinitival phrase of any aspect. Figure 1 presents a sample schema for the verb łączyć ('to link') with subject, object, nominal phrase in the instrumental case and prepositional phrase using preposition z ('with') and nominal component in the instrumental case again, as in the following example: (2) Potężne As of January 2017, Walenty contains over 65K schemata for 12K Polish verbs and 16K schemata for about 2500 nouns and is still expanding.
In our experiments we use Walenty in textual format (Hajnicz et al., 2015) which can be downloaded directly from Slowal Web application 4 (Nitoń et al., 2016). The version used in our experiment dates January 17, 2017.

Polish Coreference Corpus
The Polish Coreference Corpus 5 (Ogrodniczuk et al., 2015) is a large corpus of Polish general nominal coreference built upon NKJP. Each text of the corpus is a 250-350-word sample consisting of full subsequent paragraphs extracted from a larger text. With its 1900 documents from 14 text genres, containing about 540K tokens, 180K mentions and 128K coreference clusters, the PCC is among the largest manually annotated coreference corpora in the international community.
Mentions in PCC are understood as broadly as possible, with the following components included in the nominal phrase: Similarly some phrases with syntactic head other than nominal were also considered mentions, such as numeral phrases or coordinated nominal phrases.
The current version of PCC data is 0.92 dated December 29, 2014.

Mention Detector for Polish
The state-of-the-art mention detection tool for Polish is MentionDetector 6 which uses information from morphosyntactic, shallow syntactic and named entity annotations created with state-ofthe-art tools for Polish. MentionDetector is mostly a rule based tool with a statistical mechanism for detecting zero subjects. The following constructs are recognized: 1. single-segment nouns and nominal groups, detected with Spejd shallow parser 7 (Przepiórkowski and Buczyński, 2007) fitted with an adaptation of the NKJP grammar of Polish (Ogrodniczuk et al., 2014) 2. pronouns, identified with a disambiguating morphosyntactic tagger Pantera 8 (Acedański, 2010) with a morphological analyser and lemmatizer Morfeusz 9 (Woliński, 2014) 3. zero subjects, detected with a custom solution (Kopeć, 2014) 4. nominal named entities, detected with Nerf 10 (Waszczuk et al., 2013).
The current version of MentionDetector is 1.3 dated October 13, 2016.

The Experiment
The idea for the experiment is based on the observation that delimitation of mentions based on their semantic understanding is different for nominal and verbal constructs: for nominal phrases engaged in valency schemata (making the mention 'core') all syntactic positions should be included into the mention boundaries since they add vital supporting information to the core while for 6 http://zil.ipipan.waw.pl/ MentionDetector 7 http://zil.ipipan.waw.pl/Spejd 8 http://zil.ipipan.waw.pl/Pantera 9 http://sgjp.pl/morfeusz/index.html.en 10 http://zil.ipipan.waw.pl/Nerf verbal phrases their nominal or prepositional positions correspond to different semantic roles and cannot be linked into a single mention. This assumption is verified with schemata acquired from Walenty against the PCC gold annotation.
The entry point for both nominal and verbal parts of the experiment is the same: finite verb forms as well as nominal and prepositional phrases are detected in the text 11 and matched against valency schemata. This is achieved by comparing base forms of syntactic heads of words to entries from the valency dictionary (directly for the main Walenty entry and by creating textual representations of phrase types for syntactic positions).

Nominal realizations
If a nominal schema with two positions corresponding to phrase types detected in the document is found, both the core nominal phrase and the dependent phrases are merged into a single mention, as in: (3) Od tamtego czasu miał miejsce The results of mention detection after adding this rule to base MentionDetector are presented in Table 1 under Mention merging.

Verbal realizations
If a verbal schema with nominal or prepositional positions is detected in the document, we prevent creation of a single mention out of phrases from different syntactic positions, cf.  [on their promotion] NP(GEN) .' The results of mention detection after adding this rule are presented in Table 1 under Mention cleaning, note that nominal realizations rule is also active.

Secondary Prepositions and Phraseological Compounds
Another valuable information present in Walenty is a list of approx. 200 secondary prepositions used in syntactic schemata 12 . Since secondary prepositions are lexicalized combinations of primary (monomorphemic) prepositions and nominal or prepositional phrases, their nominal components can be often automatically (and always incorrectly) marked as mentions. Table 1 under Walenty list presents the results of removal of such mentions from the system set. The next step was expansion of the list of complex prepositions using other available sources, the first of them being The PWN Universal Dictionary of the Polish Language 13 (Dubisz, 2006). Secondly, rules responsible for building secondary prepositions out of individual prepositions and nouns in Spejd grammar were examined and their components were also excluded from the list of mention candidates. Last but not least, Spejd grammar rules for idiomatic expressions (marked as frazeo) were investigated to collect indeclinable phraseologic phrases with nominal component (underlined below) such as: • particle-adverbs (Qub), e.g. bez wątpienia 'without a doubt' • adverbs (Adv), e.g. w lot 'immediately' • interjections (Interj), e.g. broń Boże 'heaven forbid' • adjectives (Adj), e.g. na poziomie 'ambitious' • conjunctions (Conj), e.g. przy czym 'at the same time' • compounds (Comp), e.g. w miarę jak (słuchali) 'as (they listened)' That means that sometimes complex prepositions text strings are not always used as a preposition and we must know the wider text context to 12 See http://walenty.ipipan.waw.pl/ rozwiniecia_typow_fraz/.
13 Electronic version: http://usjp.pwn.pl/ distinguish whether they are truly complex prepositions or constructions bringing up mention into the discourse. Spade helps us in this distinction. The results of mention detection after adding this rule are presented in Table 1 under Secondary prepositions, note that nominal realizations and verbal realizations rules are also active.

Results
Results of mention detection follow the procedure described in (Ogrodniczuk et al., 2015). Precision, recall and F-measure are calculated using Scoreference application from the Polish Coreference Toolset 14 . As compared to SemEval approach (Recasens et al., 2010) where systems were rewarded with 1 point for correct mentions boundaries, 0.5 points for boundaries within the gold NP including its head, 0 otherwise, in our evaluation we decided not to reward partial matches but to provide two alternative mention detection scores: EXACT boundary match and HEAD match. Table 1 compares the results of exact mention detection to the best available mention detection results for Polish. The baseline for our verification is the newest result of evaluation of current version of MentionDetector on PCC test data 15 .
Nominal realizations rule increases mention detection by over 1%. We believe that it could be increased even higher with larger dictionary. Our rule is using noun constraints only and by far there are only about 2500 nouns in Walenty. Fortunately Walenty is still expanding and further score improvement is a matter of time.
Verbal realizations rule is bringing very small mention detection score improvement, on the other hand it is highly precise.
Head only detection results are presented for comparison, as we can see they have slightly increased after using secondary prepositions and phraseological compounds rule. This is because during this step we have removed a lot of wrong single-segment mentions (consisting of heads only) which has noticeable and positive impact on HEAD mention detection precision.

Conclusions
The presented experiment showed usefulness of valency schemata in the process of mention detection although the scale of improvement was relatively small. It should be attributed to several factors such as the limited size of the valency dictionary or sparsity of cases where valency rules can intervene (as opposed to 'general' cases).
The setting used only two most frequent types of phrases present in valency schemata, nominal and prepositional phrases, so one of the next steps could be analysis how other types of phrases intervene in the process of mention construction.
Even though the gains are far from being huge as compared to the progress introduced to the field in the recent years by adoption of new algorithms and architectures, experiments with integration of knowledge and semantics into the process seem worth pursuing, particularly for languages other than English for which they may offer fine-tuning of the language-independent solutions bringing slow but stable progress to results of linguistic analysis.