M. Antònia Martí

Also published as: Antonia Martí, M. A. Marti, M. A. Martí, M. Antonia Marti, M. Antonia Martí, M.A. Martí, Maria Antònia Martí, Mª Antònia Martí, Toni Martí

2022

pdf bib abs
Information Theory–based Compositional Distributional Semantics
Enrique Amigó | Alejandro Ariza-Casabona | Victor Fresno | M. Antònia Martí
Computational Linguistics, Volume 48, Issue 4 - December 2022

In the context of text representation, Compositional Distributional Semantics models aim to fuse the Distributional Hypothesis and the Principle of Compositionality. Text embedding is based on co-ocurrence distributions and the representations are in turn combined by compositional functions taking into account the text structure. However, the theoretical basis of compositional functions is still an open issue. In this article we define and study the notion of Information Theory–based Compositional Distributional Semantics (ICDS): (i) We first establish formal properties for embedding, composition, and similarity functions based on Shannon’s Information Theory; (ii) we analyze the existing approaches under this prism, checking whether or not they comply with the established desirable properties; (iii) we propose two parameterizable composition and similarity functions that generalize traditional approaches while fulfilling the formal properties; and finally (iv) we perform an empirical study on several textual similarity datasets that include sentences with a high and low lexical overlap, and on the similarity between words and their description. Our theoretical analysis and empirical results show that fulfilling formal properties affects positively the accuracy of text representation models in terms of correspondence (isometry) between the embedding and meaning spaces.

2020

pdf bib abs
Decomposing and Comparing Meaning Relations: Paraphrasing, Textual Entailment, Contradiction, and Specificity
Venelin Kovatchev | Darina Gold | M. Antonia Marti | Maria Salamo | Torsten Zesch
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper, we present a methodology for decomposing and comparing multiple meaning relations (paraphrasing, textual entailment, contradiction, and specificity). The methodology includes SHARel - a new typology that consists of 26 linguistic and 8 reason-based categories. We use the typology to annotate a corpus of 520 sentence pairs in English and we demonstrate that unlike previous typologies, SHARel can be applied to all relations of interest with a high inter-annotator agreement. We analyze and compare the frequency and distribution of the linguistic and reason-based phenomena involved in paraphrasing, textual entailment, contradiction, and specificity. This comparison allows for a much more in-depth analysis of the workings of the individual relations and the way they interact and compare with each other. We release all resources (typology, annotation guidelines, and annotated corpus) to the community.

2019

pdf bib abs
A Qualitative Evaluation Framework for Paraphrase Identification
Venelin Kovatchev | M. Antonia Marti | Maria Salamo | Javier Beltran
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

In this paper, we present a new approach for the evaluation, error analysis, and interpretation of supervised and unsupervised Paraphrase Identification (PI) systems. Our evaluation framework makes use of a PI corpus annotated with linguistic phenomena to provide a better understanding and interpretation of the performance of various PI systems. Our approach allows for a qualitative evaluation and comparison of the PI models using human interpretable categories. It does not require modification of the training objective of the systems and does not place additional burden on the developers. We replicate several popular supervised and unsupervised PI systems. Using our evaluation framework we show that: 1) Each system performs differently with respect to a set of linguistic phenomena and makes qualitatively different kinds of errors; 2) Some linguistic phenomena are more challenging than others across all systems.

2018

pdf bib abs
WARP-Text: a Web-Based Tool for Annotating Relationships between Pairs of Texts
Venelin Kovatchev | M. Antònia Martí | Maria Salamó
Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations

We present WARP-Text, an open-source web-based tool for annotating relationships between pairs of texts. WARP-Text supports multi-layer annotation and custom definitions of inter-textual and intra-textual relationships. Annotation can be performed at different granularity levels (such as sentences, phrases, or tokens). WARP-Text has an intuitive user-friendly interface both for project managers and annotators. WARP-Text fills a gap in the currently available NLP toolbox, as open-source alternatives for annotation of pairs of text are not readily available. WARP-Text has already been used in several annotation tasks and can be of interest to the researchers working in the areas of Paraphrasing, Entailment, Simplification, and Summarization, among others.

pdf bib
ETPC - A Paraphrase Identification Corpus Annotated with Extended Paraphrase Typology and Negation
Venelin Kovatchev | M. Antònia Martí | Maria Salamó
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

pdf bib abs
Problematic Cases in the Annotation of Negation in Spanish
Salud María Jiménez-Zafra | Maite Martin | L. Alfonso Ureña-López | Toni Martí | Mariona Taulé
Proceedings of the Workshop on Extra-Propositional Aspects of Meaning in Computational Linguistics (ExProM)

This paper presents the main sources of disagreement found during the annotation of the Spanish SFU Review Corpus with negation (SFU ReviewSP -NEG). Negation detection is a challenge in most of the task related to NLP, so the availability of corpora annotated with this phenomenon is essential in order to advance in tasks related to this area. A thorough analysis of the problems found during the annotation could help in the study of this phenomenon.

2013

pdf bib
Plagiarism Meets Paraphrasing: Insights for the Next Generation in Automatic Plagiarism Detection
Alberto Barrón-Cedeño | Marta Vila | M. Antònia Martí | Paolo Rosso
Computational Linguistics, Volume 39, Issue 4 - December 2013

2012

pdf bib abs
Annotating Near-Identity from Coreference Disagreements
Marta Recasens | M. Antònia Martí | Constantin Orasan
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present an extension of the coreference annotation in the English NP4E and the Catalan AnCora-CA corpora with near-identity relations, which are borderline cases of coreference. The annotated subcorpora have 50K tokens each. Near-identity relations, as presented by Recasens et al. (2010; 2011), build upon the idea that identity is a continuum rather than an either/or relation, thus introducing a middle ground category to explain currently problematic cases. The first annotation effort that we describe shows that it is not possible to annotate near-identity explicitly because subjects are not fully aware of it. Therefore, our second annotation effort used an indirect method, and arrived at near-identity annotations by inference from the disagreements between five annotators who had only a two-alternative choice between coreference and non-coreference. The results show that whereas as little as 2-6% of the relations were explicitly annotated as near-identity in the former effort, up to 12-16% of the relations turned out to be near-identical following the indirect method of the latter effort.

2010

pdf bib abs
A Typology of Near-Identity Relations for Coreference (NIDENT)
Marta Recasens | Eduard Hovy | M. Antònia Martí
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The task of coreference resolution requires people or systems to decide when two referring expressions refer to the 'same' entity or event. In real text, this is often a difficult decision because identity is never adequately defined, leading to contradictory treatment of cases in previous work. This paper introduces the concept of 'near-identity', a middle ground category between identity and non-identity, to handle such cases systematically. We present a typology of Near-Identity Relations (NIDENT) that includes fifteen types―grouped under four main families―that capture a wide range of ways in which (near-)coreference relations hold between discourse entities. We validate the theoretical model by annotating a small sample of real data and showing that inter-annotator agreement is high enough for stability (K=0.58, and up to K=0.65 and K=0.84 when leaving out one and two outliers, respectively). This work enables subsequent creation of the first internally consistent language resource of this type through larger annotation efforts.

2009

pdf bib
SemEval-2010 Task 1: Coreference Resolution in Multiple Languages
Marta Recasens | Toni Martí | Mariona Taulé | Lluís Màrquez | Emili Sapena
Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions (SEW-2009)

2008

pdf bib abs
Arabic WordNet: Semi-automatic Extensions using Bayesian Inference
Horacio Rodríguez | David Farwell | Javi Ferreres | Manuel Bertran | Musa Alkhalifa | M. Antonia Martí
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This presentation focuses on the semi-automatic extension of Arabic WordNet (AWN) using lexical and morphological rules and applying Bayesian inference. We briefly report on the current status of AWN and propose a way of extending its coverage by taking advantage of a limited set of highly productive Arabic morphological rules for deriving a range of semantically related word forms from verb entries. The application of this set of rules, combined with the use of bilingual Arabic-English resources and Princetons WordNet, allows the generation of a graph representing the semantic neighbourhood of the original word. In previous work, a set of associations between the hypothesized Arabic words and English synsets was proposed on the basis of this graph. Here, a novel approach to extending AWN is presented whereby a Bayesian Network is automatically built from the graph and then the net is used as an inferencing mechanism for scoring the set of candidate associations. Both on its own and in combination with the previous technique, this new approach has led to improved results.

pdf bib abs
AnCora: Multilevel Annotated Corpora for Catalan and Spanish
Mariona Taulé | M. Antònia Martí | Marta Recasens
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper presents AnCora, a multilingual corpus annotated at different linguistic levels consisting of 500,000 words in Catalan (AnCora-Ca) and in Spanish (AnCora-Es). At present AnCora is the largest multilayer annotated corpus of these languages freely available from http://clic.ub.edu/ancora. The two corpora consist mainly of newspaper texts annotated at different levels of linguistic description: morphological (PoS and lemmas), syntactic (constituents and functions), and semantic (argument structures, thematic roles, semantic verb classes, named entities, and WordNet nominal senses). All resulting layers are independent of each other, thus making easier the data management. The annotation was performed manually, semiautomatically, or fully automatically, depending on the encoded linguistic information. The development of these basic resources constituted a primary objective, since there was a lack of such resources for these languages. A second goal was the definition of a consistent methodology that can be followed in further annotations. The current versions of AnCora have been used in several international evaluation competitions

pdf bib abs
AnCora-Verb: A Lexical Resource for the Semantic Annotation of Corpora
Juan Aparicio | Mariona Taulé | M. Antònia Martí
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper we present two large-scale verbal lexicons, AnCora-Verb-Ca for Catalan and AnCora-Verb-Es for Spanish, which are the basis for the semantic annotation with arguments and thematic roles of AnCora corpora. In AnCora-Verb lexicons, the mapping between syntactic functions, arguments and thematic roles of each verbal predicate it is established taking into account the verbal semantic class and the diatheses alternations in which the predicate can participate. Each verbal predicate is related to one or more semantic classes basically differentiated according to the four event classes -accomplishments, achievements, states and activities-, and on the diatheses alternations in which a verb can occur. AnCora-Verb-Es contains a total of 1,965 different verbs corresponding to 3,671 senses and AnCora-Verb-Ca contains 2,151 verbs and 4,513 senses. These figures correspond to the total of 500,000 words contained in each corpus, AnCora-Ca and AnCora-Es. The lexicons and the annotated corpora constitute the richest linguistic resources of this kind freely available for Spanish and Catalan. The big amount of linguistic information contained in both resources should be of great interest for computational applications and linguistic studies. Currently, a consulting interface for these lexicons is available at (http://clic.ub.edu/ancora/).

We present a machine translation framework in which the interlingua— Lexical Conceptual Structure (LCS)—is coupled with a definitional component that includes bilingual (EuroWordNet) links between words in the source and target languages. While the links between individual words are language-specific, the LCS is designed to be a language-independent, compositional representation. We take the view that the two types of information—shallower, transfer-like knowledge as well as deeper, compositional knowledge—can be reconciled in interlingual machine translation, the former for overcoming the intractability of LCS-based lexical selec- tion, and the latter for relating the underlying semantics of two words cross-linguistically. We describe the acquisition process for these two information types and present results of hand-verification of the acquired lexicon. Finally, we demonstrate the utility of the two information types in interlingual MT.