Pierre Colombo


2023

pdf bib
Toward Stronger Textual Attack Detectors
Pierre Colombo | Marine Picot | Nathan Noiry | Guillaume Staerman | Pablo Piantanida
Findings of the Association for Computational Linguistics: EMNLP 2023

The landscape of available textual adversarial attacks keeps growing, posing severe threats and raising concerns regarding deep NLP systems integrity. However, the crucial problem of defending against malicious attacks has only drawn few attention in the NLP community. The latter is nonetheless instrumental to develop robust and trustworthy systems. This paper makes two important contributions in this line of search: (i) we introduce LAROUSSE, a new framework to detect textual adversarial attacks and (ii) we introduce STAKEOUT, an extended benchmark composed of nine popular attack methods, three datasets and two pre-trained models. LAROUSSE is ready-to-use in production as it is unsupervised, hyperparameter free and non-differentiable, protecting it against gradient-based methods. Our new benchmark STAKEOUT allows for a robust evaluation framework: we conduct extensive numerical experiments which demonstrate that LAROUSSE outperforms previous methods, and which allows to identify interesting factor of detection rate variations.

pdf bib
Steering Large Language Models for Machine Translation with Finetuning and In-Context Learning
Duarte Alves | Nuno Guerreiro | João Alves | José Pombal | Ricardo Rei | José de Souza | Pierre Colombo | Andre Martins
Findings of the Association for Computational Linguistics: EMNLP 2023

Large language models (LLMs) are a promising avenue for machine translation (MT). However, current LLM-based MT systems are brittle: their effectiveness highly depends on the choice of few-shot examples and they often require extra post-processing due to overgeneration. Alternatives such as finetuning on translation instructions are computationally expensive and may weaken in-context learning capabilities, due to overspecialization. In this paper, we provide a closer look at this problem. We start by showing that adapter-based finetuning with LoRA matches the performance of traditional finetuning while reducing the number of training parameters by a factor of 50. This method also outperforms few-shot prompting and eliminates the need for post-processing or in-context examples. However, we show that finetuning generally degrades few-shot performance, hindering adaptation capabilities. Finally, to obtain the best of both worlds, we propose a simple approach that incorporates few-shot examples during finetuning. Experiments on 10 language pairs show that our proposed approach recovers the original few-shot capabilities while keeping the added benefits of finetuning.

pdf bib
The Glass Ceiling of Automatic Evaluation in Natural Language Generation
Pierre Colombo | Maxime Peyrard | Nathan Noiry | Robert West | Pablo Piantanida
Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings)

pdf bib
A Novel Information Theoretic Objective to Disentangle Representations for Fair Classification
Pierre Colombo | Nathan Noiry | Guillaume Staerman | Pablo Piantanida
Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings)

pdf bib
Transductive Learning for Textual Few-Shot Classification in API-based Embedding Models
Pierre Colombo | Victor Pellegrain | Malik Boudiaf | Myriam Tami | Victor Storchan | Ismail Ayed | Pablo Piantanida
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Proprietary and closed APIs are becoming increasingly common to process natural language, and are impacting the practical applications of natural language processing, including few-shot classification. Few-shot classification involves training a model to perform a new classification task with a handful of labeled data. This paper presents three contributions. First, we introduce a scenario where the embedding of a pre-trained model is served through a gated API with compute-cost and data-privacy constraints. Second, we propose a transductive inference, a learning paradigm that has been overlooked by the NLP community. Transductive inference, unlike traditional inductive learning, leverages the statistics of unlabelled data. We also introduce a new parameter-free transductive regularizer based on the Fisher-Rao loss, which can be used on top of the gated API embeddings. This method fully utilizes unlabelled data, does not share any label with the third-party API provider and could serve as a baseline for future research. Third, we propose an improved experimental setting and compile a benchmark of eight datasets involving multiclass classification in four different languages, with up to 151 classes. We evaluate our methods using eight backbone models, along with an episodic evaluation over 1,000 episodes, which demonstrate the superiority of transductive inference over the standard inductive setting.

pdf bib
RainProof: An Umbrella to Shield Text Generator from Out-Of-Distribution Data
Maxime Darrin | Pablo Piantanida | Pierre Colombo
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Implementing effective control mechanisms to ensure the proper functioning and security of deployed NLP models, from translation to chatbots, is essential. A key ingredient to ensure safe system behaviour is Out-Of-Distribution (OOD) detection, which aims to detect whether an input sample is statistically far from the training distribution. Although OOD detection is a widely covered topic in classification tasks, most methods rely on hidden features output by the encoder. In this work, we focus on leveraging soft-probabilities in a black-box framework, i.e. we can access the soft-predictions but not the internal states of the model. Our contributions include: (i) RAINPROOF a Relative informAItioN Projection OOD detection framework; and (ii) a more operational evaluation setting for OOD detection. Surprisingly, we find that OOD detection is not necessarily aligned with task-specific measures. The OOD detector may filter out samples well processed by the model and keep samples that are not, leading to weaker performance. Our results show that RAINPROOF provides OOD detection methods more aligned with task-specific performance metrics than traditional OOD detectors.

pdf bib
Revisiting Instruction Fine-tuned Model Evaluation to Guide Industrial Applications
Manuel Faysse | Gautier Viaud | Céline Hudelot | Pierre Colombo
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Instruction Fine-Tuning (IFT) is a powerful paradigm that strengthens the zero-shot capabilities of Large Language Models (LLMs), but in doing so induces new evaluation metric requirements. We show LLM-based metrics to be well adapted to these requirements, and leverage them to conduct an investigation of task-specialization strategies, quantifying the trade-offs that emerge in practical industrial settings. Our findings offer practitioners actionable insights for real-world IFT model deployment.

pdf bib
Optimal Transport for Unsupervised Hallucination Detection in Neural Machine Translation
Nuno M. Guerreiro | Pierre Colombo | Pablo Piantanida | André Martins
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Neural machine translation (NMT) has become the de-facto standard in real-world machine translation applications. However, NMT models can unpredictably produce severely pathological translations, known as hallucinations, that seriously undermine user trust. It becomes thus crucial to implement effective preventive strategies to guarantee their proper functioning. In this paper, we address the problem of hallucination detection in NMT by following a simple intuition: as hallucinations are detached from the source content, they exhibit encoder-decoder attention patterns that are statistically different from those of good quality translations. We frame this problem with an optimal transport formulation and propose a fully unsupervised, plug-in detector that can be used with any attention-based NMT model. Experimental results show that our detector not only outperforms all previous model-based detectors, but is also competitive with detectors that employ external models trained on millions of samples for related tasks such as quality estimation and cross-lingual sentence similarity.

2022

pdf bib
Learning Disentangled Textual Representations via Statistical Measures of Similarity
Pierre Colombo | Guillaume Staerman | Nathan Noiry | Pablo Piantanida
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

When working with textual data, a natural application of disentangled representations is the fair classification where the goal is to make predictions without being biased (or influenced) by sensible attributes that may be present in the data (e.g., age, gender or race). Dominant approaches to disentangle a sensitive attribute from textual representations rely on learning simultaneously a penalization term that involves either an adversary loss (e.g., a discriminator) or an information measure (e.g., mutual information). However, these methods require the training of a deep neural network with several parameter updates for each update of the representation model. As a matter of fact, the resulting nested optimization loop is both times consuming, adding complexity to the optimization dynamic, and requires a fine hyperparameter selection (e.g., learning rates, architecture). In this work, we introduce a family of regularizers for learning disentangled representations that do not require training. These regularizers are based on statistical measures of similarity between the conditional probability distributions with respect to the sensible attributes. Our novel regularizers do not require additional training, are faster and do not involve additional tuning while achieving better results both when combined with pretrained and randomly initialized text encoders.

pdf bib
Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation of Story Generation
Cyril Chhun | Pierre Colombo | Fabian M. Suchanek | Chloé Clavel
Proceedings of the 29th International Conference on Computational Linguistics

Research on Automatic Story Generation (ASG) relies heavily on human and automatic evaluation. However, there is no consensus on which human evaluation criteria to use, and no analysis of how well automatic criteria correlate with them. In this paper, we propose to re-evaluate ASG evaluation. We introduce a set of 6 orthogonal and comprehensive human criteria, carefully motivated by the social sciences literature. We also present HANNA, an annotated dataset of 1,056 stories produced by 10 different ASG systems. HANNA allows us to quantitatively evaluate the correlations of 72 automatic metrics with human criteria. Our analysis highlights the weaknesses of current metrics for ASG and allows us to formulate practical recommendations for ASG evaluation.

2021

pdf bib
Beam Search with Bidirectional Strategies for Neural Response Generation
Pierre Colombo | Chloé Clavel | Chouchang Yack | Giovanna Varni
Proceedings of the 4th International Conference on Natural Language and Speech Processing (ICNLSP 2021)

pdf bib
Improving Multimodal fusion via Mutual Dependency Maximisation
Pierre Colombo | Emile Chapuis | Matthieu Labeau | Chloé Clavel
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Multimodal sentiment analysis is a trending area of research, and multimodal fusion is one of its most active topic. Acknowledging humans communicate through a variety of channels (i.e visual, acoustic, linguistic), multimodal systems aim at integrating different unimodal representations into a synthetic one. So far, a consequent effort has been made on developing complex architectures allowing the fusion of these modalities. However, such systems are mainly trained by minimising simple losses such as L1 or cross-entropy. In this work, we investigate unexplored penalties and propose a set of new objectives that measure the dependency between modalities. We demonstrate that our new penalties lead to a consistent improvement (up to 4.3 on accuracy) across a large variety of state-of-the-art models on two well-known sentiment analysis datasets: CMU-MOSI and CMU-MOSEI. Our method not only achieves a new SOTA on both datasets but also produces representations that are more robust to modality drops. Finally, a by-product of our methods includes a statistical network which can be used to interpret the high dimensional representations learnt by the model.

pdf bib
Code-switched inspired losses for spoken dialog representations
Pierre Colombo | Emile Chapuis | Matthieu Labeau | Chloé Clavel
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Spoken dialogue systems need to be able to handle both multiple languages and multilinguality inside a conversation (e.g in case of code-switching). In this work, we introduce new pretraining losses tailored to learn generic multilingual spoken dialogue representations. The goal of these losses is to expose the model to code-switched language. In order to scale up training, we automatically build a pretraining corpus composed of multilingual conversations in five different languages (French, Italian, English, German and Spanish) from OpenSubtitles, a huge multilingual corpus composed of 24.3G tokens. We test the generic representations on MIAM, a new benchmark composed of five dialogue act corpora on the same aforementioned languages as well as on two novel multilingual tasks (i.e multilingual mask utterance retrieval and multilingual inconsistency identification). Our experiments show that our new losses achieve a better performance in both monolingual and multilingual settings.

pdf bib
Automatic Text Evaluation through the Lens of Wasserstein Barycenters
Pierre Colombo | Guillaume Staerman | Chloé Clavel | Pablo Piantanida
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

A new metric BaryScore to evaluate text generation based on deep contextualized embeddings (e.g., BERT, Roberta, ELMo) is introduced. This metric is motivated by a new framework relying on optimal transport tools, i.e., Wasserstein distance and barycenter. By modelling the layer output of deep contextualized embeddings as a probability distribution rather than by a vector embedding; this framework provides a natural way to aggregate the different outputs through the Wasserstein space topology. In addition, it provides theoretical grounds to our metric and offers an alternative to available solutions (e.g., MoverScore and BertScore). Numerical evaluation is performed on four different tasks: machine translation, summarization, data2text generation and image captioning. Our results show that BaryScore outperforms other BERT based metrics and exhibits more consistent behaviour in particular for text summarization.

pdf bib
A Novel Estimator of Mutual Information for Learning to Disentangle Textual Representations
Pierre Colombo | Pablo Piantanida | Chloé Clavel
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Learning disentangled representations of textual data is essential for many natural language tasks such as fair classification, style transfer and sentence generation, among others. The existent dominant approaches in the context of text data either rely on training an adversary (discriminator) that aims at making attribute values difficult to be inferred from the latent code or rely on minimising variational bounds of the mutual information between latent code and the value attribute. However, the available methods suffer of the impossibility to provide a fine-grained control of the degree (or force) of disentanglement. In contrast to adversarial methods, which are remarkably simple, although the adversary seems to be performing perfectly well during the training phase, after it is completed a fair amount of information about the undesired attribute still remains. This paper introduces a novel variational upper bound to the mutual information between an attribute and the latent code of an encoder. Our bound aims at controlling the approximation error via the Renyi’s divergence, leading to both better disentangled representations and in particular, a precise control of the desirable degree of disentanglement than state-of-the-art methods proposed for textual data. Furthermore, it does not suffer from the degeneracy of other losses in multi-class scenarios. We show the superiority of this method on fair classification and on textual style transfer tasks. Additionally, we provide new insights illustrating various trade-offs in style transfer when attempting to learn disentangled representations and quality of the generated sentence.

2020

pdf bib
The importance of fillers for text representations of speech transcripts
Tanvi Dinkar | Pierre Colombo | Matthieu Labeau | Chloé Clavel
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

While being an essential component of spoken language, fillers (e.g. “um” or “uh”) often remain overlooked in Spoken Language Understanding (SLU) tasks. We explore the possibility of representing them with deep contextualised embeddings, showing improvements on modelling spoken language and two downstream tasks — predicting a speaker’s stance and expressed confidence.

pdf bib
Hierarchical Pre-training for Sequence Labelling in Spoken Dialog
Emile Chapuis | Pierre Colombo | Matteo Manica | Matthieu Labeau | Chloé Clavel
Findings of the Association for Computational Linguistics: EMNLP 2020

Sequence labelling tasks like Dialog Act and Emotion/Sentiment identification are a key component of spoken dialog systems. In this work, we propose a new approach to learn generic representations adapted to spoken dialog, which we evaluate on a new benchmark we call Sequence labellIng evaLuatIon benChmark fOr spoken laNguagE benchmark (SILICONE). SILICONE is model-agnostic and contains 10 different datasets of various sizes. We obtain our representations with a hierarchical encoder based on transformer architectures, for which we extend two well-known pre-training objectives. Pre-training is performed on OpenSubtitles: a large corpus of spoken dialog containing over 2.3 billion of tokens. We demonstrate how hierarchical encoders achieve competitive results with consistently fewer parameters compared to state-of-the-art models and we show their importance for both pre-training and fine-tuning.

2019

pdf bib
Affect-Driven Dialog Generation
Pierre Colombo | Wojciech Witon | Ashutosh Modi | James Kennedy | Mubbasir Kapadia
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

The majority of current systems for end-to-end dialog generation focus on response quality without an explicit control over the affective content of the responses. In this paper, we present an affect-driven dialog system, which generates emotional responses in a controlled manner using a continuous representation of emotions. The system achieves this by modeling emotions at a word and sequence level using: (1) a vector representation of the desired emotion, (2) an affect regularizer, which penalizes neutral words, and (3) an affect sampling method, which forces the neural network to generate diverse words that are emotionally relevant. During inference, we use a re-ranking procedure that aims to extract the most emotionally relevant responses using a human-in-the-loop optimization process. We study the performance of our system in terms of both quantitative (BLEU score and response diversity), and qualitative (emotional appropriateness) measures.

pdf bib
From the Token to the Review: A Hierarchical Multimodal approach to Opinion Mining
Alexandre Garcia | Pierre Colombo | Florence d’Alché-Buc | Slim Essid | Chloé Clavel
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

The task of predicting fine grained user opinion based on spontaneous spoken language is a key problem arising in the development of Computational Agents as well as in the development of social network based opinion miners. Unfortunately, gathering reliable data on which a model can be trained is notoriously difficult and existing works rely only on coarsely labeled opinions. In this work we aim at bridging the gap separating fine grained opinion models already developed for written language and coarse grained models developed for spontaneous multimodal opinion mining. We take advantage of the implicit hierarchical structure of opinions to build a joint fine and coarse grained opinion model that exploits different views of the opinion expression. The resulting model shares some properties with attention-based models and is shown to provide competitive results on a recently released multimodal fine grained annotated corpus.

2018

pdf bib
Disney at IEST 2018: Predicting Emotions using an Ensemble
Wojciech Witon | Pierre Colombo | Ashutosh Modi | Mubbasir Kapadia
Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

This paper describes our participating system in the WASSA 2018 shared task on emotion prediction. The task focuses on implicit emotion prediction in a tweet. In this task, keywords corresponding to the six emotion labels used (anger, fear, disgust, joy, sad, and surprise) have been removed from the tweet text, making emotion prediction implicit and the task challenging. We propose a model based on an ensemble of classifiers for prediction. Each classifier uses a sequence of Convolutional Neural Network (CNN) architecture blocks and uses ELMo (Embeddings from Language Model) as an input. Our system achieves a 66.2% F1 score on the test set. The best performing system in the shared task has reported a 71.4% F1 score.