Coreference Resolution for the Basque Language with BART

In this paper we present our work on Coref-erence Resolution in Basque, a unique language which poses interesting challenges for the problem of coreference. We explain how we extend the coreference resolution toolkit, BART, in order to enable it to process Basque. Then we run four different experiments showing both a signiﬁcant improvement by extending a baseline feature set and the effect of calculating performance of hand-parsed mentions vs. automatically parsed mentions. Finally, we discuss some key characteristics of Basque which make it particularly challenging for coreference and draw a road map for future work.


Introduction
Basque is a language spoken by nearly three quarters of a million people, most of which live in the Basque country, a region spanning parts of northern Spain and southwestern France. One of the most surprising findings about the Basque language is that it cannot be linked with any of its Indo-European neighbours in Europe and, hence, has been classified as a language isolate. It differs considerably in grammar from the languages spoken in surrounding regions. It is an agglutinative, head-final, pro-drop, free-word order language (Laka, 1996).
Naturally, the Basque language has also inspired a lot of work in Computational Linguistics with tools for automatically processing it becoming increasingly available (Alegria et al., 1996;Alegria et al., 2002;Alegria et al., 2003;Aduriz and Díaz de Ilarraza, 2003;Alegria et al., 2008). However, as it is the case with most less-resourced languages, there are tools for the core processing levels, such as tokenisation, sentence splitting, morphological analysis, syntactic parsing/chunking, but much less so for higher semantic levels required in end goal applications such as Question Answering (Morton, 2000), Text Summarisation (Steinberger et al., 2007) or Information Extraction (Def, 1995;Hirschman, 1998). One such intermediate problem which has been underresearched for Basque, and hence, no readily usable tools are publicly available yet, is that of Coreference Resolution (Poesio et al., 2016).
However, preliminary work on Coreference for Basque is starting to emerge (Soraluze et al., 2015), and in this paper we describe our work on extending the coreference resolution toolkit, BART 1 (Versley et al., 2008) to the Basque language. BART benefits from an open architecture and provides a mechanism through language plugins which makes it particularly suitable for adaptations to new languages, and it attained good performance in the shared task on Multilingual Coreference at CoNLL 2012 (Uryupina et al., 2012).
For our experiments we use the EPEC corpus annotated for coreference (Aduriz et al., 2006) and we run experiments across two dimensions. First, we use a baseline model based on (Soon et al., 2001) vs. a model that includes extra features reliably extracted for Basque with the tools at hand. Second, we measure performance on hand-parsed mentions vs. performance on automatically parsed mentions which illustrates the effect of pre-processing quality on the end results.
One of the key challenges that the Basque language introduces for Coreference is that it uses a genderless system for pronouns. In our experiments we look in more depth around this issue and show the challenges it presents as well as suggest viable solutions to model it with machine learning techniques.
The remainder of this paper is organised as follows: Section §2 briefly surveys related work, Section §3 gives details of EPEC, a coreference corpus, Section §4 describes the extension of BART to Basque, Section §5 presents results and provides a discussion on the challenges for coreference in Basque, and towards the end we draw conclusions and pointers to future work.

Related Work
Preliminary work on Coreference for Basque was done by (Soraluze et al., 2015) where they adapt the Stanford coreference resolution system (Lee et al., 2013) to Basque. And there has been a lot of work on extending the BART coreference toolkit to languages other than English.  extend it to Italian using the Evalita corpus of Wikipedia articles (Broscheit et al., 2010) work on German using the TüBa-D/Z coreference corpus, (Kopeć and Ogrodniczuk, 2012) develop the Polish plug-in using a subset of the National Corpus of Polish, and finally (Uryupina et al., 2012) run experiments on Arabic and Chinese.

Annotated Corpus of Basque
EPEC (Reference Corpus for the Processing of Basque) (Aduriz et al., 2006) is a 300,000 word sample collection of standard written Basque that has been manually annotated at different levels (morphology, surface syntax, phrases, etc.). The corpus is composed by news published in Euskaldunon Egunkaria, a Basque language newspaper. It is aimed to be a reference corpus for the development and improvement of several NLP tools for Basque.
Recently, mentions and coreference chains were also annotated by two linguists in a subset of the EPEC corpus which is composed of about 45,000 words. First, automatically annotated mentions obtained by our mention detector were corrected; then, coreferent mentions were linked in clusters. The mention detector is a set of hand-crafted rules that have been compiled into Finite State Transducers (FST). The FSTs match chunks and clauses provided by the preprocessing tools and identify the mentions and their boundaries. Further discussion about the FSTs' behaviour can be found in (Soraluze et al., 2012).
All the annotation process has been carried out using the MMAX2 annotation tool (Müller and Strube, 2006). The coreference annotation of the EPEC corpus is explained more in detail in (Ceberio et al., 2016).
To adapt BART to Basque, we divided the dataset into three main parts: one for training the system, the other for tuning, and the last for testing. More detailed information about the three parts can be found in Table 1. Train 23520  6525  1011  3401  Devel  6914  1907  302  982  Test  15949  4360  621  2445   Table 1: EPEC-coref corpus division information.

Extending BART to Basque
BART was originally created for English, but its flexible modular architecture ensures its portability to other languages. BART consists of five main components: preprocessing pipeline, mention factory, feature extraction module, decoder and encoder. Furthermore, an additional independent Language Plugin module handles language specific information and is accessible from any component.
In the adaptation process of BART, we used a preprocessing pipeline of Basque linguistic processors, developed the Basque Language Plugin and added new features for coreference resolution specifically geared towards Basque.

Preprocessing and Mention Detection
The preprocessing pipeline takes raw texts and applies a series of Basque linguistic processors to analyse the texts: i) A morphological analyser that performs word segmentation and PoS tagging (Alegria et al., 1996), ii) A lemmatiser that resolves the ambiguity caused at the previous phase (Alegria et al., 68 2002), iii) A multi-word item identifier that determines which groups of two or more words are to be considered multi-word expressions (Alegria et al., 2004), iv) A named-entity recogniser that identifies and classifies named entities (person, organisation, location) in the text (Alegria et al., 2003), v) A chunker, an analyser that identifies verbal and nominal chunks based on rule-based grammars (Aduriz and Díaz de Ilarraza, 2003), vi) A clause tagger, that is, an analyser that identifies clauses, combining rulebased-grammars and machine learning techniques (Alegria et al., 2008).
After the preprocessing step, mentions that are potential candidates to be part of coreference chains are identified using the mention detector explained in Section 3.
Finally, the linguistic information obtained by the preprocessing tools and the mentions identified by the mentions detector are stored in stand-off format of the MMAX2 annotation tool (Müller and Strube, 2006) that BART uses.

Basque Language Plugin
Developing a Basque language plugin for BART involved building on the system's already existing language plugins, and then translating closedclass words such as pronouns, mapping key part-ofspeech tags and adapting lower-level heuristics for finding the head noun in noun phrases, person and number identification, as well as reading features made available by the preprocessing tools.

Feature engineering for Basque
Some kind of linguistic information from the mention is used by all the features implemented in BART. MentionFactory computes these properties when a language is supported by BART. In the case of a new language, such as Basque, they should be provided as part of the mention representation computed by external preprocessing facilities. So, we added in the MMAX2 files relevant features for coreference resolution in Basque, as are number and lemma.
For our experiments, we trained BART with two different models. The first one, is a simple model, presented by (Soon et al., 2001). 2 The second one, 2 Due to the way we integrated the preprocessing pipeline for is an improved version of the first one where more Basque oriented features have been added. The features used in each model are presented in Table 2.
In the two models, gender agreement does not cause any improvement in the scores, as Basque is genderless. 3 At this point the proposed new features to handle the specificity of Basque are not new and have also been used for other languages (see (Poesio et al., 2016) for details).

Experimental Results
We have tested the two models presented in Subsection 4.3 in two different environments. In the first one automatically detected mentions are provided to the models and in the second one the mentions are gold. 4 The metrics used in our evaluations are MUC (Vilain et al., 1995), B 3 (Bagga and Baldwin, 1998), CEAF e (Luo, 2005), CEAF m (Luo, 2005), and BLANC (Recasens and Hovy, 2011). The scores have been calculated using the reference implementation of the CoNLL scorer (Pradhan et al., 2014).    When gold mentions are used the Basque model also outperforms the Soon baseline according to all the metrics, except B 3 . The official CoNLL metric is outperformed by 5.61 points.
Comparing the results obtained when gold mentions are used with those obtained with the automatic mentions, there is a considerable difference. CoNLL F 1 of Soon baseline is 50.05 when automatic mentions are provided, while providing gold mentions this value raises to 66.81, an increase of 16.76. Similar increase in CoNLL F 1 happens with the Basque model. In this case, there is an increase 5 The CoNLL metric is the arithmetic mean of MUC, B 3 and CEAFe metrics. of 18.7 points, from 53.72 with automatic mentions to 72.42 when gold mentions are used.
We also had a look at the pronoun resolution performance alone, but only MUC scores on automatic mentions as the CoNLL scorer does not provide a break-down of scores per anaphor type, and there was a small gain in performance from the Soon baseline to the Basque model from F 1 = 27.4 to F 1 = 33.0, respectively. The gain is due mostly to higher precision, suggesting the additional features in the Basque model help discriminate better erroneously resolved pronouns in the baseline model, however, more work will need to be devoted to improving recall, which is particularly challenging in the case of Basque due to the lack of gender in the Basque pronoun system.

Error Analysis
In our error analysis we had a look at examples from our corpus covering the following four cases: Case a. There were errors in the coreference resolution due to errors in the pre-processing which were propagated across the pipeline. Consider example 1, for instance: 6 (1)

Discussion
Taking into consideration Basque most relevant grammatical characteristics, in some aspects it is more challenging to resolve coreferences in this language than in others. Since Basque is an agglutinative language, a given lemma takes many different word forms, depending on the case (genitive, locative, etc.) or the number (singular, plural, indefinite) for nouns and adjectives. For example, the lemma lehendakari ("president") forms the inflections lehendakaria ("the president"), lehendakariak ("the president"), lehendakariari ("to the president"), lehendakariei ("to the presidents"), lehendakariaren ("of the president"), etc. This means that looking only for the given exact word, is not enough for Basque 7 English translation: "Since he was elected as president, [Djukanovic] has greatly changed [his] policy lines". 8 English translation: "After this government meeting, [Jospin] will go on holidays, and will surely use it to reflect on Chevenement case and to maintain [his] activity with a new solution". 9 English translation: " [We] have never won on penalties." After the extension finished without goals, a large shadow turn off the stadium. Rijkaard said they prepared penalties with great attention,"so that [our] story would not occur again". to resolve coreference when string matching techniques are applied and as we observed in our experiments the use of lemmas is more effective in morphologically rich languages.
Besides the agglutination, there is no grammatical gender in the nominal system. Nouns and adjectives have no distinct endings depending on gender. In addition, there are no distinct forms for third person pronouns in Basque, and demonstratives are used as third person pronominals (Laka, 1996).
This makes it impossible to use gender as a feature in the resolution process which has been proven particularly useful in the resolution of pronouns, for example. Furthermore, the animacy feature cannot be used for pronoun resolution either. In this scenario, distance-based features, like Sentence Distance and Markable distance could be the most effective features for pronoun resolution. Nevertheless, research will have to be devoted to finding other useful features to make up for the lack of gender and animacy.

Conclusion
In this paper we presented our ongoing work on Coreference Resolution in Basque. We described the main resource we have been using which is the EPEC corpus annotated with coreferences and we explained how we have been adapting the coreference resolution toolkit, BART, to enable it to process Basque. We ran two levels of experiments one resolving coreferences using the gold mentions and one using automatically parsed mentions and we trained two different models for each, a baseline model based on (Soon et al., 2001) and a Basque model with extended feature set. We showed that the Basque model significantly outperforms the baseline. We also discussed key characteristics of the Basque language which make it particularly challenging for coreference.
Next we plan to investigate more in depth suitable features that can both make up for the lack of gender and animacy and be extracted reliably from unrestricted text. We also plan to run an extrinsic evaluation guaging the effect of coreference on a higher level task. 71