Kata Gábor


2023

pdf bib
Comparative Analysis of Anomaly Detection Algorithms in Text Data
Yizhou Xu | Kata Gábor | Jérôme Milleret | Frédérique Segond
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

Text anomaly detection (TAD) is a crucial task that aims to identify texts that deviate significantly from the norm within a corpus. Despite its importance in various domains, TAD remains relatively underexplored in natural language processing. This article presents a systematic evaluation of 22 TAD algorithms on 17 corpora using multiple text representations, including monolingual and multilingual SBERT. The performance of the algorithms is compared based on three criteria: degree of supervision, theoretical basis, and architecture used. The results demonstrate that semi-supervised methods utilizing weak labels outperform both unsupervised methods and semi-supervised methods using only negative samples for training. Additionally, we explore the application of TAD techniques in hate speech detection. The results provide valuable insights for future TAD research and guide the selection of suitable algorithms for detecting text anomalies in different contexts.

pdf bib
Syntax and Geometry of Information
Raphaël Bailly | Laurent Leblond | Kata Gábor
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

This paper presents an information-theoretical model of syntactic generalization. We study syntactic generalization from the perspective of the capacity to disentangle semantic and structural information, emulating the human capacity to assign a grammaticality judgment to semantically nonsensical sentences. In order to isolate the structure, we propose to represent the probability distribution behind a corpus as the product of the probability of a semantic context and the probability of a structure, the latter being independent of the former. We further elaborate the notion of abstraction as a relaxation of the property of independence. It is based on the measure of structural and contextual information for a given representation. We test abstraction as an optimization objective on the task of inducing syntactic categories from natural language data and show that it significantly outperforms alternative methods. Furthermore, we find that when syntax-unaware optimization objectives succeed in the task, their success is mainly due to an implicit disentanglement process rather than to the model structure. On the other hand, syntactic categories can be deduced in a principled way from the independence between structure and context.

2022

pdf bib
Détection d’anomalies textuelles à base de l’ingénierie d’invite (Prompt Engineering-Based Text Anomaly Detection )
Yizhou Xu | Kata Gábor | Leila Khouas | Frédérique Segond
Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

La détection d’anomalies textuelles est une tâche importante de la fouille de textes. Plusieurs approches générales, visant l’identification de points de données aberrants, ont été appliqués dans ce domaine. Néanmoins, ces approches exploitent peu les nouvelles avancées du traitement automatique des langues naturelles (TALN). L’avènement des modèles de langage pré-entraînés comme BERT et GPT-2 a donné naissance à un nouveau paradigme de l’apprentissage automatique appelé ingénierie d’invite (prompt engineering) qui a montré de bonnes performances sur plusieurs tâches du TALN. Cet article présente un travail exploratoire visant à examiner la possibilité de détecter des anomalies textuelles à l’aide de l’ingénierie d’invite. Dans nos expérimentations, nous avons examiné la performance de différents modèles d’invite. Les résultats ont montré que l’ingénierie d’invite est une méthode prometteuse pour la détection d’anomalies textuelles.

2020

pdf bib
Emergence of Syntax Needs Minimal Supervision
Raphaël Bailly | Kata Gábor
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

This paper is a theoretical contribution to the debate on the learnability of syntax from a corpus without explicit syntax-specific guidance. Our approach originates in the observable structure of a corpus, which we use to define and isolate grammaticality (syntactic information) and meaning/pragmatics information. We describe the formal characteristics of an autonomous syntax and show that it becomes possible to search for syntax-based lexical categories with a simple optimization process, without any prior hypothesis on the form of the model.

2018

pdf bib
SemEval-2018 Task 7: Semantic Relation Extraction and Classification in Scientific Papers
Kata Gábor | Davide Buscaldi | Anne-Kathrin Schumann | Behrang QasemiZadeh | Haïfa Zargayouna | Thierry Charnois
Proceedings of the 12th International Workshop on Semantic Evaluation

This paper describes the first task on semantic relation extraction and classification in scientific paper abstracts at SemEval 2018. The challenge focuses on domain-specific semantic relations and includes three different subtasks. The subtasks were designed so as to compare and quantify the effect of different pre-processing steps on the relation classification results. We expect the task to be relevant for a broad range of researchers working on extracting specialized knowledge from domain corpora, for example but not limited to scientific or bio-medical information extraction. The task attracted a total of 32 participants, with 158 submissions across different scenarios.

pdf bib
Apport des dépendances syntaxiques et des patrons séquentiels à l’extraction de relations ()
Kata Gábor | Nadège Lechevrel | Isabelle Tellier | Davide Buscaldi | Haifa Zargayouna | Thierry Charnois
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN

2017

pdf bib
Exploring Vector Spaces for Semantic Relations
Kata Gábor | Haïfa Zargayouna | Isabelle Tellier | Davide Buscaldi | Thierry Charnois
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Word embeddings are used with success for a variety of tasks involving lexical semantic similarities between individual words. Using unsupervised methods and just cosine similarity, encouraging results were obtained for analogical similarities. In this paper, we explore the potential of pre-trained word embeddings to identify generic types of semantic relations in an unsupervised experiment. We propose a new relational similarity measure based on the combination of word2vec’s CBOW input and output vectors which outperforms concurrent vector representations, when used for unsupervised clustering on SemEval 2010 Relation Classification data.

2016

pdf bib
Semantic Annotation of the ACL Anthology Corpus for the Automatic Analysis of Scientific Literature
Kata Gábor | Haïfa Zargayouna | Davide Buscaldi | Isabelle Tellier | Thierry Charnois
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper describes the process of creating a corpus annotated for concepts and semantic relations in the scientific domain. A part of the ACL Anthology Corpus was selected for annotation, but the annotation process itself is not specific to the computational linguistics domain and could be applied to any scientific corpora. Concepts were identified and annotated fully automatically, based on a combination of terminology extraction and available ontological resources. A typology of semantic relations between concepts is also proposed. This typology, consisting of 18 domain-specific and 3 generic relations, is the result of a corpus-based investigation of the text sequences occurring between concepts in sentences. A sample of 500 abstracts from the corpus is currently being manually annotated with these semantic relations. Only explicit relations are taken into account, so that the data could serve to train or evaluate pattern-based semantic relation classification systems.

pdf bib
Détection et classification non supervisées de relations sémantiques dans des articles scientifiques (Unsupervised Classification of Semantic Relations in Scientific Papers)
Kata Gábor | Isabelle Tellier | Thierry Charnois | Haïfa Zargayouna | Davide Buscaldi
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Articles longs)

Dans cet article, nous abordons une tâche encore peu explorée, consistant à extraire automatiquement l’état de l’art d’un domaine scientifique à partir de l’analyse d’articles de ce domaine. Nous la ramenons à deux sous-tâches élémentaires : l’identification de concepts et la reconnaissance de relations entre ces concepts. Une extraction terminologique permet d’identifier les concepts candidats, qui sont ensuite alignés à des ressources externes. Dans un deuxième temps, nous cherchons à reconnaître et classifier automatiquement les relations sémantiques entre concepts de manière nonsupervisée, en nous appuyant sur différentes techniques de clustering et de biclustering. Nous mettons en œuvre ces deux étapes dans un corpus extrait de l’archive de l’ACL Anthology. Une analyse manuelle nous a permis de proposer une typologie des relations sémantiques, et de classifier un échantillon d’instances de relations. Les premières évaluations suggèrent l’intérêt du biclustering pour détecter de nouveaux types de relations dans le corpus.

2014

pdf bib
Named Entity Recognition and Correction in OCRized Corpora (Détection et correction automatique d’entités nommées dans des corpus OCRisés) [in French]
Benoît Sagot | Kata Gábor
Proceedings of TALN 2014 (Volume 2: Short Papers)

pdf bib
Automated Error Detection in Digitized Cultural Heritage Documents
Kata Gábor | Benoît Sagot
Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)

pdf bib
The WoDiS System - WOlf and DIStributions for Lexical Substitution (Le système WoDiS - WOLF et DIStributions pour la substitution lexicale) [in French]
Kata Gábor
TALN-RECITAL 2014 Workshop SemDis 2014 : Enjeux actuels de la sémantique distributionnelle (SemDis 2014: Current Challenges in Distributional Semantics)

2012

pdf bib
Boosting the Coverage of a Semantic Lexicon by Automatically Extracted Event Nominalizations
Kata Gábor | Marianna Apidianaki | Benoît Sagot | Éric Villemonte de La Clergerie
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this article, we present a distributional analysis method for extracting nominalization relations from monolingual corpora. The acquisition method makes use of distributional and morphological information to select nominalization candidates. We explain how the learning is performed on a dependency annotated corpus and describe the nominalization results. Furthermore, we show how these results served to enrich an existing lexical resource, the WOLF (Wordnet Libre du Franc¸ais). We present the techniques that we developed in order to integrate the new information into WOLF, based on both its structure and content. Finally, we evaluate the validity of the automatically obtained information and the correctness of its integration into the semantic resource. The method proved to be useful for boosting the coverage of WOLF and presents the advantage of filling verbal synsets, which are particularly difficult to handle due to the high level of verbal polysemy.

2010

pdf bib
Acquisition de connaissances lexicales à partir de corpus : la sous-catégorisation verbale en français [Lexical acquisition from corpora: the case of subcategorization frames in French]
Cédric Messiant | Kata Gábor | Thierry Poibeau
Traitement Automatique des Langues, Volume 51, Numéro 1 : Varia [Varia]

2007

pdf bib
Clustering Hungarian Verbs on the Basis of Complementation Patterns
Kata Gábor | Enikő Héja
Proceedings of the ACL 2007 Student Research Workshop