Itziar Gonzalez-Dios


2024

pdf bib
Automatic Detection and Labelling of Personal Data in Case Reports from the ECHR in Spanish: Evaluation of Two Different Annotation Approaches
Maria Sierro | Begoña Altuna | Itziar Gonzalez-Dios
Proceedings of the Workshop on Computational Approaches to Language Data Pseudonymization (CALD-pseudo 2024)

In this paper we evaluate two annotation approaches for automatic detection and labelling of personal information in legal texts in relation to the ambiguity of the labels and the homogeneity of the annotations. For this purpose, we built a corpus of 44 case reports from the European Court of Human Rights in Spanish language and we annotated it following two different annotation approaches: automatic projection of the annotations of an existing English corpus, and manual annotation with our reinterpretation of their guidelines. Moreover, we employ Flair on a Named Entity Recognition task to compare its performance in the two annotation schemes.

2023

pdf bib
Towards Effective Correction Methods Using WordNet Meronymy Relations
Javier Álvez | Itziar Gonzalez-Dios | German Rigau
Proceedings of the 12th Global Wordnet Conference

In this paper, we analyse and compare several correction methods of knowledge resources with the purpose of improving the abilities of systems that require commonsense reasoning with the least possible human-effort. To this end, we cross-check the WordNet meronymy relation member against the knowledge encoded in a SUMO-based first-order logic ontology on the basis of the mapping between WordNet and SUMO. In particular, we focus on the knowledge in WordNet regarding the taxonomy of animals and plants. Despite being created manually, these knowledge resources — WordNet, SUMO and their mapping — are not free of errors and discrepancies. Thus, we propose three correction methods by semi-automatically improving the alignment between WordNet and SUMO, by performing some few corrections in SUMO and by combining the above two strategies. The evaluation of each method includes the required human-effort and the achieved improvement on unseen data from the WebChild project, that is tested using first-order logic automated theorem provers.

pdf bib
This is not a Dataset: A Large Negation Benchmark to Challenge Large Language Models
Iker García-Ferrero | Begoña Altuna | Javier Alvez | Itziar Gonzalez-Dios | German Rigau
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Although large language models (LLMs) have apparently acquired a certain level of grammatical knowledge and the ability to make generalizations, they fail to interpret negation, a crucial step in Natural Language Processing. We try to clarify the reasons for the sub-optimal performance of LLMs understanding negation. We introduce a large semi-automatically generated dataset of circa 400,000 descriptive sentences about commonsense knowledge that can be true or false in which negation is present in about 2/3 of the corpus in different forms. We have used our dataset with the largest available open LLMs in a zero-shot approach to grasp their generalization and inference capability and we have also fine-tuned some of the models to assess whether the understanding of negation can be trained. Our findings show that, while LLMs are proficient at classifying affirmative sentences, they struggle with negative sentences and lack a deep understanding of negation, often relying on superficial cues. Although fine-tuning the models on negative sentences improves their performance, the lack of generalization in handling negation is persistent, highlighting the ongoing challenges of LLMs regarding negation understanding and generalization. The dataset and code are publicly available.

2022

pdf bib
IrekiaLFes: a New Open Benchmark and Baseline Systems for Spanish Automatic Text Simplification
Itziar Gonzalez-Dios | Iker Gutiérrez-Fandiño | Oscar m. Cumbicus-Pineda | Aitor Soroa
Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)

Automatic Text simplification (ATS) seeks to reduce the complexity of a text for a general public or a target audience. In the last years, deep learning methods have become the most used systems in ATS research, but these systems need large and good quality datasets to be evaluated. Moreover, these data are available on a large scale only for English and in some cases with restrictive licenses. In this paper, we present IrekiaLF_es, an open-license benchmark for Spanish text simplification. It consists of a document-level corpus and a sentence-level test set that has been manually aligned. We also conduct a neurolinguistically-based evaluation of the corpus in order to reveal its suitability for text simplification. This evaluation follows the Lexicon-Unification-Linearity (LeULi) model of neurolinguistic complexity assessment. Finally, we present a set of experiments and baselines of ATS systems in a zero-shot scenario.

pdf bib
Textual Entailment for Event Argument Extraction: Zero- and Few-Shot with Multi-Source Learning
Oscar Sainz | Itziar Gonzalez-Dios | Oier Lopez de Lacalle | Bonan Min | Eneko Agirre
Findings of the Association for Computational Linguistics: NAACL 2022

Recent work has shown that NLP tasks such as Relation Extraction (RE) can be recasted as a Textual Entailment tasks using verbalizations, with strong performance in zero-shot and few-shot settings thanks to pre-trained entailment models. The fact that relations in current RE datasets are easily verbalized casts doubts on whether entailment would be effective in more complex tasks. In this work we show that entailment is also effective in Event Argument Extraction (EAE), reducing the need of manual annotation to 50% and 20% in ACE and WikiEvents, respectively, while achieving the same performance as with full training. More importantly, we show that recasting EAE as entailment alleviates the dependency on schemas, which has been a roadblock for transferring annotations between domains. Thanks to entailment, the multi-source transfer between ACE and WikiEvents further reduces annotation down to 10% and 5% (respectively) of the full training without transfer. Our analysis shows that key to good results is the use of several entailment datasets to pre-train the entailment model. Similar to previous approaches, our method requires a small amount of effort for manual verbalization: only less than 15 minutes per event argument types is needed; comparable results can be achieved from users of different level of expertise.

pdf bib
Patterns of Text Readability in Human and Predicted Eye Movements
Nora Hollenstein | Itziar Gonzalez-Dios | Lisa Beinborn | Lena Jäger
Proceedings of the Workshop on Cognitive Aspects of the Lexicon

It has been shown that multilingual transformer models are able to predict human reading behavior when fine-tuned on small amounts of eye tracking data. As the cumulated prediction results do not provide insights into the linguistic cues that the model acquires to predict reading behavior, we conduct a deeper analysis of the predictions from the perspective of readability. We try to disentangle the three-fold relationship between human eye movements, the capability of language models to predict these eye movement patterns, and sentence-level readability measures for English. We compare a range of model configurations to multiple baselines. We show that the models exhibit difficulties with function words and that pre-training only provides limited advantages for linguistic generalization.

2021

pdf bib
What is on Social Media that is not in WordNet? A Preliminary Analysis on the TwitterAAE Corpus
Cecilia Domingo | Tatiana Gonzalez-Ferrero | Itziar Gonzalez-Dios
Proceedings of the 11th Global Wordnet Conference

Natural Language Processing tools and resources have been so far mainly created and trained for standard varieties of language. Nowadays, with the use of large amounts of data gathered from social media, other varieties and registers need to be processed, which may present other challenges and difficulties. In this work, we focus on English and we present a preliminary analysis by comparing the TwitterAAE corpus, which is annotated for ethnicity, and WordNet by quantifying and explaining the online language that WordNet misses.

pdf bib
A Syntax-Aware Edit-based System for Text Simplification
Oscar M. Cumbicus-Pineda | Itziar Gonzalez-Dios | Aitor Soroa
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Edit-based text simplification systems have attained much attention in recent years due to their ability to produce simplification solutions that are interpretable, as well as requiring less training examples compared to traditional seq2seq systems. Edit-based systems learn edit operations at a word level, but it is well known that many of the operations performed when simplifying text are of a syntactic nature. In this paper we propose to add syntactic information into a well known edit-based system. We extend the system with a graph convolutional network module that mimics the dependency structure of the sentence, thus giving the model an explicit representation of syntax. We perform a series of experiments in English, Spanish and Italian, and report improvements of the state of the art in four out of five datasets. Further analysis shows that syntactic information is always beneficial, and suggest that syntax is more helpful in complex sentences.

2020

pdf bib
LagunTest: A NLP Based Application to Enhance Reading Comprehension
Kepa Bengoetxea | Itziar Gonzalez-Dios | Amaia Aguirregoitia
Proceedings of the 1st Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI)

The ability to read and understand written texts plays an important role in education, above all in the last years of primary education. This is especially pertinent in language immersion educational programmes, where some students have low linguistic competence in the languages of instruction. In this context, adapting the texts to the individual needs of each student requires a considerable effort by education professionals. However, language technologies can facilitate the laborious adaptation of materials in order to enhance reading comprehension. In this paper, we present LagunTest, a NLP based application that takes as input a text in Basque or English, and offers synonyms, definitions, examples of the words in different contexts and presents some linguistic characteristics as well as visualizations. LagunTest is based on reusable and open multilingual and multimodal tools, and it is also distributed with an open license. LagunTest is intended to ease the burden of education professionals in the task of adapting materials, and the output should always be supervised by them.

pdf bib
Proceedings of the LREC 2020 Workshop on Multimodal Wordnets (MMW2020)
Thierry Declerk | Itziar Gonzalez-Dios | German Rigau
Proceedings of the LREC 2020 Workshop on Multimodal Wordnets (MMW2020)

pdf bib
Towards modelling SUMO attributes through WordNet adjectives: a Case Study on Qualities
Itziar Gonzalez-Dios | Javier Alvez | German Rigau
Proceedings of the LREC 2020 Workshop on Multimodal Wordnets (MMW2020)

Previous studies have shown that the knowledge about attributes and properties in the SUMO ontology and its mapping to WordNet adjectives lacks of an accurate and complete characterization. A proper characterization of this type of knowledge is required to perform formal commonsense reasoning based on the SUMO properties, for instance to distinguish one concept from another based on their properties. In this context, we propose a new semi-automatic approach to model the knowledge about properties and attributes in SUMO by exploiting the information encoded in WordNet adjectives and its mapping to SUMO. To that end, we considered clusters of semantically related groups of WordNet adjectival and nominal synsets. Based on these clusters, we propose a new semi-automatic model for SUMO attributes and their mapping to WordNet, which also includes polarity information. In this paper, as an exploratory approach, we focus on qualities.

pdf bib
Exploring the Enrichment of Basque WordNet with a Sentiment Lexicon
Itziar Gonzalez-Dios | Jon Alkorta
Proceedings of the LREC 2020 Workshop on Multimodal Wordnets (MMW2020)

Wordnets are lexical databases where the semantic relations of words and concepts are established. These resources are useful for manyNLP tasks, such as automatic text classification, word-sense disambiguation or machine translation. In comparison with other wordnets,the Basque version is smaller and some PoS are underrepresented or missing e.g. adjectives and adverbs. In this work, we explore anovel approach to enrich the Basque WordNet, focusing on the adjectives. We want to prove the use and and effectiveness of sentimentlexicons to enrich the resource without the need of starting from scratch. Using as complementary resources, one dictionary and thesentiment valences of the words, we check if the word of the lexicon matches with the meaning of the synset, and if it matches we addthe word as variant to the Basque WordNet. Following this methodology, we describe the most frequent adjectives with positive andnegative valence, the matches and the possible solutions for the non-matches.

2019

pdf bib
Commonsense Reasoning Using WordNet and SUMO: a Detailed Analysis
Javier Álvez | Itziar Gonzalez-Dios | German Rigau
Proceedings of the 10th Global Wordnet Conference

We describe a detailed analysis of a sample of large benchmark of commonsense reasoning problems that has been automatically obtained from WordNet, SUMO and their mapping. The objective is to provide a better assessment of the quality of both the benchmark and the involved knowledge resources for advanced commonsense reasoning tasks. By means of this analysis, we are able to detect some knowledge misalignments, mapping errors and lack of knowledge and resources. Our final objective is the extraction of some guidelines towards a better exploitation of this commonsense knowledge framework by the improvement of the included resources.

pdf bib
Textual genre based approach to use WordNet in language-for-specific-purpose classroom as dictionary
Itziar Gonzalez-Dios
Proceedings of the 10th Global Wordnet Conference

When teaching language for specific purposes (LSP) linguistic resources are needed to help students understand and write specialised texts. As building a lexical resource is costly, we explore the use of wordnets to represent the terms that can be found in particular textual domains. In order to gather the terms to be included in wordnets, we propose a textual genre approach, that leads us to introduce a new relation term used in to link all the possible terms/synsets that can appear in a text to the synset of the textual genre. This way, students can use wordnet as dictionary or thesaurus when writing specialised texts. We explain our approach by means of the logbooks and terms in Basque. A side effect of this works is also enriching the wordnets with new variants and synsets.

2018

pdf bib
Verbal Multiword Expressions in Basque Corpora
Uxoa Iñurrieta | Itziar Aduriz | Ainara Estarrona | Itziar Gonzalez-Dios | Antton Gurrutxaga | Ruben Urizar | Iñaki Alegria
Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018)

This paper presents a Basque corpus where Verbal Multiword Expressions (VMWEs) were annotated following universal guidelines. Information on the annotation is given, and some ideas for discussion upon the guidelines are also proposed. The corpus is useful not only for NLP-related research, but also to draw conclusions on Basque phraseology in comparison with other languages.

pdf bib
Cross-checking WordNet and SUMO Using Meronymy
Javier Álvez | Itziar Gonzalez-Dios | German Rigau
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
Framework for the Analysis of Simplified Texts Taking Discourse into Account: the Basque Causal Relations as Case Study
Itziar Gonzalez-Dios | Arantza Diaz de Ilarraza | Mikel Iruskieta
Proceedings of the 6th Workshop on Recent Advances in RST and Related Formalisms

2016

pdf bib
A Preliminary Study of Statistically Predictive Syntactic Complexity Features and Manual Simplifications in Basque
Itziar Gonzalez-Dios | María Jesús Aranzabe | Arantza Díaz de Ilarraza
Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC)

In this paper, we present a comparative analysis of statistically predictive syntactic features of complexity and the treatment of these features by humans when simplifying texts. To that end, we have used a list of the most five statistically predictive features obtained automatically and the Corpus of Basque Simplified Texts (CBST) to analyse how the syntactic phenomena in these features have been manually simplified. Our aim is to go beyond the descriptions of operations found in the corpus and relate the multidisciplinary findings to understand text complexity from different points of view. We also present some issues that can be important when analysing linguistic complexity.

2014

pdf bib
Making Biographical Data in Wikipedia Readable: A Pattern-based Multilingual Approach
Itziar Gonzalez-Dios | María Jesús Aranzabe | Arantza Díaz de Ilarraza
Proceedings of the Workshop on Automatic Text Simplification - Methods and Applications in the Multilingual Society (ATS-MA 2014)

pdf bib
Simple or Complex? Assessing the readability of Basque Texts
Itziar Gonzalez-Dios | María Jesús Aranzabe | Arantza Díaz de Ilarraza | Haritz Salaberri
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers