Vivi Nastase

Also published as: Vivi Năstase


2023

pdf bib
BLM-s/lE: A structured dataset of English spray-load verb alternations for testing generalization in LLMs
Giuseppe Samo | Vivi Nastase | Chunyang Jiang | Paola Merlo
Findings of the Association for Computational Linguistics: EMNLP 2023

Current NLP models appear to be achieving performance comparable to human capabilities on well-established benchmarks. New benchmarks are now necessary to test deeper layers of understanding of natural languages by these models. Blackbird’s Language Matrices are a recently developed framework that draws inspiration from tests of human analytic intelligence. The BLM task has revealed that successful performances in previously studied linguistic problems do not yet stem from a deep understanding of the generative factors that define these problems. In this study, we define a new BLM task for predicate-argument structure, and develop a structured dataset for its investigation, concentrating on the spray-load verb alternations in English, as a case study. The context sentences include one alternant from the spray-load alternation and the target sentence is the other alternant, to be chosen among a minimally contrastive and adversarial set of answers. We describe the generation process of the dataset and the reasoning behind the generating rules. The dataset aims to facilitate investigations into how verb information is encoded in sentence embeddings and how models generalize to the complex properties of argument structures. Benchmarking experiments conducted on the dataset and qualitative error analysis on the answer set reveal the inherent challenges associated with the problem even for current high-performing representations.

pdf bib
BLM-AgrF: A New French Benchmark to Investigate Generalization of Agreement in Neural Networks
Aixiu An | Chunyang Jiang | Maria A. Rodriguez | Vivi Nastase | Paola Merlo
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Successful machine learning systems currently rely on massive amounts of data, which are very effective in hiding some of the shallowness of the learned models. To help train models with more complex and compositional skills, we need challenging data, on which a system is successful only if it detects structure and regularities, that will allow it to generalize. In this paper, we describe a French dataset (BLM-AgrF) for learning the underlying rules of subject-verb agreement in sentences, developed in the BLM framework, a new task inspired by visual IQ tests known as Raven’s Progressive Matrices. In this task, an instance consists of sequences of sentences with specific attributes. To predict the correct answer as the next element of the sequence, a model must correctly detect the generative model used to produce the dataset. We provide details and share a dataset built following this methodology. Two exploratory baselines based on commonly used architectures show that despite the simplicity of the phenomenon, it is a complex problem for deep learning systems.

pdf bib
Grammatical information in BERT sentence embeddings as two-dimensional arrays
Vivi Nastase | Paola Merlo
Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023)

Sentence embeddings induced with various transformer architectures encode much semantic and syntactic information in a distributed manner in a one-dimensional array. We investigate whether specific grammatical information can be accessed in these distributed representations. Using data from a task developed to test rule-like generalizations, our experiments on detecting subject-verb agreement yield several promising results. First, we show that while the usual sentence representations encoded as one-dimensional arrays do not easily support extraction of rule-like regularities, a two-dimensional reshaping of these vectors allows various learning architectures to access such information. Next, we show that various architectures can detect patterns in these two-dimensional reshaped sentence embeddings and successfully learn a model based on smaller amounts of simpler training data, which performs well on more complex test data. This indicates that current sentence embeddings contain information that is regularly distributed, and which can be captured when the embeddings are reshaped into higher dimensional arrays. Our results cast light on representations produced by language models and help move towards developing few-shot learning approaches.

pdf bib
Blackbird Language Matrices Tasks for Generalization
Paola Merlo | Chunyang Jiang | Giuseppe Samo | Vivi Nastase
Proceedings of the 1st GenBench Workshop on (Benchmarking) Generalisation in NLP

To develop a system with near-human language capabilities, we need to understand current systems’ generalisation and compositional abilities. We approach this by generating compositional, structured data, inspired from visual intelligence tests, that depend on the problem-solvers being able to disentangle objects and their absolute and relative properties in a sequence of images. We design an analogous task and develop the corresponding datasets that capture specific linguistic phenomena and their properties. Solving each problem instance depends on detecting the relevant linguistic objects and generative rules of the problem. We propose two datasets modelling two linguistic phenomena – subject-verb agreement in French, and verb alternations in English. The datasets can be used to investigate how LLMs encode linguistic objects, such as phrases, their grammatical and semantic properties, such as number or semantic role, and how such information is combined to correctly solve each problem. Specifically generated error types help investigate the behaviour of the system, which important information it is able to detect, and which structures mislead it.

2022

pdf bib
Proceedings of the 11th Joint Conference on Lexical and Computational Semantics
Vivi Nastase | Ellie Pavlick | Mohammad Taher Pilehvar | Jose Camacho-Collados | Alessandro Raganato
Proceedings of the 11th Joint Conference on Lexical and Computational Semantics

2021

pdf bib
Proceedings of *SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics
Lun-Wei Ku | Vivi Nastase | Ivan Vulić
Proceedings of *SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics

2019

pdf bib
Abstract Graphs and Abstract Paths for Knowledge Graph Completion
Vivi Nastase | Bhushan Kotnis
Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019)

Knowledge graphs, which provide numerous facts in a machine-friendly format, are incomplete. Information that we induce from such graphs – e.g. entity embeddings, relation representations or patterns – will be affected by the imbalance in the information captured in the graph – by biasing representations, or causing us to miss potential patterns. To partially compensate for this situation we describe a method for representing knowledge graphs that capture an intensional representation of the original extensional information. This representation is very compact, and it abstracts away from individual links, allowing us to find better path candidates, as shown by the results of link prediction using this information.

pdf bib
Towards Extracting Medical Family History from Natural Language Interactions: A New Dataset and Baselines
Mahmoud Azab | Stephane Dadian | Vivi Nastase | Larry An | Rada Mihalcea
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

We introduce a new dataset consisting of natural language interactions annotated with medical family histories, obtained during interactions with a genetic counselor and through crowdsourcing, following a questionnaire created by experts in the domain. We describe the data collection process and the annotations performed by medical professionals, including illness and personal attributes (name, age, gender, family relationships) for the patient and their family members. An initial system that performs argument identification and relation extraction shows promising results – average F-score of 0.87 on complex sentences on the targeted relations.

pdf bib
Assessing the Difficulty of Classifying ConceptNet Relations in a Multi-Label Classification Setting
Maria Becker | Michael Staniek | Vivi Nastase | Anette Frank
RELATIONS - Workshop on meaning relations between phrases and sentences

Commonsense knowledge relations are crucial for advanced NLU tasks. We examine the learnability of such relations as represented in ConceptNet, taking into account their specific properties, which can make relation classification difficult: a given concept pair can be linked by multiple relation types, and relations can have multi-word arguments of diverse semantic types. We explore a neural open world multi-label classification approach that focuses on the evaluation of classification accuracy for individual relations. Based on an in-depth study of the specific properties of the ConceptNet resource, we investigate the impact of different relation representations and model variations. Our analysis reveals that the complexity of argument types and relation ambiguity are the most important challenges to address. We design a customized evaluation method to address the incompleteness of the resource that can be expanded in future work.

pdf bib
Proceedings of the Workshop on Extracting Structured Knowledge from Scientific Publications
Vivi Nastase | Benjamin Roth | Laura Dietz | Andrew McCallum
Proceedings of the Workshop on Extracting Structured Knowledge from Scientific Publications

pdf bib
Anglicized Words and Misspelled Cognates in Native Language Identification
Ilia Markov | Vivi Nastase | Carlo Strapparava
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

In this paper, we present experiments that estimate the impact of specific lexical choices of people writing in a second language (L2). In particular, we look at misspelled words that indicate lexical uncertainty on the part of the author, and separate them into three categories: misspelled cognates, “L2-ed” (in our case, anglicized) words, and all other spelling errors. We test the assumption that such errors contain clues about the native language of an essay’s author through the task of native language identification. The results of the experiments show that the information brought by each of these categories is complementary. We also note that while the distribution of such features changes with the proficiency level of the writer, their contribution towards native language identification remains significant at all levels.

pdf bib
Metaphors in Text Simplification: To change or not to change, that is the question
Yulia Clausen | Vivi Nastase
Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications

We present an analysis of metaphors in news text simplification. Using features that capture general and metaphor specific characteristics, we test whether we can automatically identify which metaphors will be changed or preserved, and whether there are features that have different predictive power for metaphors or literal words. The experiments show that the Age of Acquisition is the most distinctive feature for both metaphors and literal words. Features that capture Imageability and Concreteness are useful when used alone, but within the full set of features they lose their impact. Frequency of use seems to be the best feature to differentiate metaphors that should be changed and those to be preserved.

2018

pdf bib
Classifying Semantic Clause Types With Recurrent Neural Networks: Analysis of Attention, Context & Genre Characteristics
Maria Becker | Michael Staniek | Vivi Nastase | Alexis Palmer | Anette Frank
Traitement Automatique des Langues, Volume 59, Numéro 2 : Apprentissage profond pour le traitement automatique des langues [Deep Learning for natural language processing]

pdf bib
Induction of a Large-Scale Knowledge Graph from the Regesta Imperii
Juri Opitz | Leo Born | Vivi Nastase
Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

We induce and visualize a Knowledge Graph over the Regesta Imperii (RI), an important large-scale resource for medieval history research. The RI comprise more than 150,000 digitized abstracts of medieval charters issued by the Roman-German kings and popes distributed over many European locations and a time span of more than 700 years. Our goal is to provide a resource for historians to visualize and query the RI, possibly aiding medieval history research. The resulting medieval graph and visualization tools are shared publicly.

pdf bib
The Role of Emotions in Native Language Identification
Ilia Markov | Vivi Nastase | Carlo Strapparava | Grigori Sidorov
Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

We explore the hypothesis that emotion is one of the dimensions of language that surfaces from the native language into a second language. To check the role of emotions in native language identification (NLI), we model emotion information through polarity and emotion load features, and use document representations using these features to classify the native language of the author. The results indicate that emotion is relevant for NLI, even for high proficiency levels and across topics.

pdf bib
Punctuation as Native Language Interference
Ilia Markov | Vivi Nastase | Carlo Strapparava
Proceedings of the 27th International Conference on Computational Linguistics

In this paper, we describe experiments designed to explore and evaluate the impact of punctuation marks on the task of native language identification. Punctuation is specific to each language, and is part of the indicators that overtly represent the manner in which each language organizes and conveys information. Our experiments are organized in various set-ups: the usual multi-class classification for individual languages, also considering classification by language groups, across different proficiency levels, topics and even cross-corpus. The results support our hypothesis that punctuation marks are persistent and robust indicators of the native language of the author, which do not diminish in influence even when a high proficiency level in a non-native language is achieved.

pdf bib
Correction of OCR Word Segmentation Errors in Articles from the ACL Collection through Neural Machine Translation Methods
Vivi Nastase | Julian Hitschler
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
DeModify: A Dataset for Analyzing Contextual Constraints on Modifier Deletion
Vivi Nastase | Devon Fritz | Anette Frank
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
Improving Native Language Identification by Using Spelling Errors
Lingzhen Chen | Carlo Strapparava | Vivi Nastase
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

In this paper, we explore spelling errors as a source of information for detecting the native language of a writer, a previously under-explored area. We note that character n-grams from misspelled words are very indicative of the native language of the author. In combination with other lexical features, spelling error features lead to 1.2% improvement in accuracy on classifying texts in the TOEFL11 corpus by the author’s native language, compared to systems participating in the NLI shared task.

pdf bib
Word Etymology as Native Language Interference
Vivi Nastase | Carlo Strapparava
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

We present experiments that show the influence of native language on lexical choice when producing text in another language – in this particular case English. We start from the premise that non-native English speakers will choose lexical items that are close to words in their native language. This leads us to an etymology-based representation of documents written by people whose mother tongue is an Indo-European language. Based on this representation we grow a language family tree, that matches closely the Indo-European language tree.

pdf bib
Classifying Semantic Clause Types: Modeling Context and Genre Characteristics with Recurrent Neural Networks and Attention
Maria Becker | Michael Staniek | Vivi Nastase | Alexis Palmer | Anette Frank
Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017)

Detecting aspectual properties of clauses in the form of situation entity types has been shown to depend on a combination of syntactic-semantic and contextual features. We explore this task in a deep-learning framework, where tuned word representations capture lexical, syntactic and semantic features. We introduce an attention mechanism that pinpoints relevant context not only for the current instance, but also for the larger context. Apart from implicitly capturing task relevant features, the advantage of our neural model is that it avoids the need to reproduce linguistic features for other languages and is thus more easily transferable. We present experiments for English and German that achieve competitive performance. We present a novel take on modeling and exploiting genre information and showcase the adaptation of our system from one language to another.

2015

bib
Learning Semantic Relations from Text
Preslav Nakov | Vivi Nastase | Diarmuid Ó Séaghdha | Stan Szpakowicz
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts

Every non-trivial text describes interactions and relations between people, institutions, activities, events and so on. What we know about the world consists in large part of such relations, and that knowledge contributes to the understanding of what texts refer to. Newly found relations can in turn become part of this knowledge that is stored for future use.To grasp a text’s semantic content, an automatic system must be able to recognize relations in texts and reason about them. This may be done by applying and updating previously acquired knowledge. We focus here in particular on semantic relations which describe the interactions among nouns and compact noun phrases, and we present such relations from both a theoretical and a practical perspective. The theoretical exploration sketches the historical path which has brought us to the contemporary view and interpretation of semantic relations. We discuss a wide range of relation inventories proposed by linguists and by language processing people. Such inventories vary by domain, granularity and suitability for downstream applications.On the practical side, we investigate the recognition and acquisition of relations from texts. In a look at supervised learning methods, we present available datasets, the variety of features which can describe relation instances, and learning algorithms found appropriate for the task. Next, we present weakly supervised and unsupervised learning methods of acquiring relations from large corpora with little or no previously annotated data. We show how enduring the bootstrapping algorithm based on seed examples or patterns has proved to be, and how it has been adapted to tackle Web-scale text collections. We also show a few machine learning techniques which can perform fast and reliable relation extraction by taking advantage of data redundancy and variability.

pdf bib
Multi-Level Alignments As An Extensible Representation Basis for Textual Entailment Algorithms
Tae-Gil Noh | Sebastian Padó | Vered Shwartz | Ido Dagan | Vivi Nastase | Kathrin Eichler | Lili Kotlerman | Meni Adler
Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics

2014

pdf bib
Mapping WordNet Domains, WordNet Topics and Wikipedia Categories to Generate Multilingual Domain Specific Resources
Spandana Gella | Carlo Strapparava | Vivi Nastase
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we present the mapping between WordNet domains and WordNet topics, and the emergent Wikipedia categories. This mapping leads to a coarse alignment between WordNet and Wikipedia, useful for producing domain-specific and multilingual corpora. Multilinguality is achieved through the cross-language links between Wikipedia categories. Research in word-sense disambiguation has shown that within a specific domain, relevant words have restricted senses. The multilingual, and comparable, domain-specific corpora we produce have the potential to enhance research in word-sense disambiguation and terminology extraction in different languages, which could enhance the performance of various NLP tasks.

2013

pdf bib
Proceedings of TextGraphs-8 Graph-based Methods for Natural Language Processing
Zornitsa Kozareva | Irina Matveeva | Gabor Melli | Vivi Nastase
Proceedings of TextGraphs-8 Graph-based Methods for Natural Language Processing

pdf bib
Bridging Languages through Etymology: The case of cross language text categorization
Vivi Nastase | Carlo Strapparava
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2012

pdf bib
Concept-based Selectional Preferences and Distributional Representations from Wikipedia Articles
Alex Judea | Vivi Nastase | Michael Strube
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper describes the derivation of distributional semantic representations for open class words relative to a concept inventory, and of concepts relative to open class words through grammatical relations extracted from Wikipedia articles. The concept inventory comes from WikiNet, a large-scale concept network derived from Wikipedia. The distinctive feature of these representations are their relation to a concept network, through which we can compute selectional preferences of open-class words relative to general concepts. The resource thus derived provides a meaning representation that complements the relational representation captured in the concept network. It covers English open-class words, but the concept base is language independent. The resource can be extended to other languages, with the use of language specific dependency parsers. Good results in metonymy resolution show the resource's potential use for NLP applications.

pdf bib
Word Epoch Disambiguation: Finding How Words Change Over Time
Rada Mihalcea | Vivi Nastase
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Local and Global Context for Supervised and Unsupervised Metonymy Resolution
Vivi Nastase | Alex Judea | Katja Markert | Michael Strube
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

2011

pdf bib
WikiNetTK – A Tool Kit for EmbeddingWorld Knowledge in NLP Applications
Alex Judea | Vivi Nastase | Michael Strube
Proceedings of the IJCNLP 2011 System Demonstrations

2010

pdf bib
WikiNet: A Very Large Scale Multi-Lingual Concept Network
Vivi Nastase | Michael Strube | Benjamin Boerschinger | Caecilia Zirn | Anas Elghafari
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper describes a multi-lingual large-scale concept network obtained automatically by mining for concepts and relations and exploiting a variety of sources of knowledge from Wikipedia. Concepts and their lexicalizations are extracted from Wikipedia pages, in particular from article titles, hyperlinks, disambiguation pages and cross-language links. Relations are extracted from the category and page network, from the category names, from infoboxes and the body of the articles. The resulting network has two main components: (i) a central, language independent index of concepts, which serves to keep track of the concepts' lexicalizations both within a language and across languages, and to separate linguistic expressions of concepts from the relations in which they are involved (concepts themselves are represented as numeric IDs); (ii) a large network built on the basis of the relations extracted, represented as relations between concepts (more specifically, the numeric IDs). The various stages of obtaining the network were separately evaluated, and the results show a qualitative resource.

2009

pdf bib
Combining Collocations, Lexical and Encyclopedic Knowledge for Metonymy Resolution
Vivi Nastase | Michael Strube
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

pdf bib
What’s in a name? In some languages, grammatical gender
Vivi Nastase | Marius Popescu
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

2008

pdf bib
The Telling Tail: Signals of Success in Electronic Negotiation Texts
Marina Sokolova | Vivi Nastase | Stan Szpakowicz
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I

pdf bib
Unsupervised All-words Word Sense Disambiguation with Grammatical Dependencies
Vivi Nastase
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II

pdf bib
How to Add a New Language on the NLP Map: Building Resources and Tools for Languages with Scarce Resources
Rada Mihalcea | Vivi Nastase
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II

pdf bib
Topic-Driven Multi-Document Summarization with Encyclopedic Knowledge and Spreading Activation
Vivi Nastase
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

pdf bib
Acquiring a Taxonomy from the German Wikipedia
Laura Kassner | Vivi Nastase | Michael Strube
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper presents the process of acquiring a large, domain independent, taxonomy from the German Wikipedia. We build upon a previously implemented platform that extracts a semantic network and taxonomy from the English version of the Wikipedia. We describe two accomplishments of our work: the semantic network for the German language in which isa links are identified and annotated, and an expansion of the platform for easy adaptation for a new language. We identify the platform’s strengths and shortcomings, which stem from the scarcity of free processing resources for languages other than English. We show that the taxonomy induction process is highly reliable - evaluated against the German version of WordNet, GermaNet, the resource obtained shows an accuracy of 83.34%.

2007

pdf bib
SemEval-2007 Task 04: Classification of Semantic Relations between Nominals
Roxana Girju | Preslav Nakov | Vivi Nastase | Stan Szpakowicz | Peter Turney | Deniz Yuret
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

2006

pdf bib
A Study of Two Graph Algorithms in Topic-driven Summarization
Vivi Nastase | Stan Szpakowicz
Proceedings of TextGraphs: the First Workshop on Graph Based Methods for Natural Language Processing

pdf bib
Matching syntactic-semantic graphs for semantic relation assignment
Vivi Nastase | Stan Szpakowicz
Proceedings of TextGraphs: the First Workshop on Graph Based Methods for Natural Language Processing

2004

pdf bib
Finding Semantic Associations on Express Lane
Vivi Năstase | Rada Mihalcea
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
An evaluation exercise for Romanian Word Sense Disambiguation
Rada Mihalcea | Vivi Năstase | Timothy Chklovski | Doina Tătar | Dan Tufiş | Florentina Hristea
Proceedings of SENSEVAL-3, the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text

2002

pdf bib
Letter Level Learning for Language Independent Diacritics Restoration
Rada Mihalcea | Vivi Nastase
COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002)