Sérgio Matos


2024

pdf bib
Exploring efficient zero-shot synthetic dataset generation for Information Retrieval
Tiago Almeida | Sérgio Matos
Findings of the Association for Computational Linguistics: EACL 2024

The broad integration of neural retrieval models into Information Retrieval (IR) systems is significantly impeded by the high cost and laborious process associated with the manual labelling of training data. Similarly, synthetic training data generation, a potential workaround, often requires expensive computational resources due to the reliance on large language models. This work explored the potential of small language models for efficiently creating high-quality synthetic datasets to train neural retrieval models. We aim to identify an optimal method to generate synthetic datasets, enabling training neural reranking models in document collections where annotated data is unavailable. We introduce a novel methodology, grounded in the principles of information theory, to select the most appropriate documents to be used as context for question generation. Then, we employ a small language model for zero-shot conditional question generation, supplemented by a filtering mechanism to ensure the quality of generated questions. Extensive evaluation on five datasets unveils the potential of our approach, outperforming unsupervised retrieval methods such as BM25 and pretrained monoT5. Our findings indicate that an efficiently generated “silver-standard” dataset allows effective training of neural rerankers in unlabeled scenarios. To ensure reproducibility and facilitate wider application, we will release a code repository featuring an accessible API for zero-shot synthetic question generation.

2021

pdf bib
Benchmarking a transformer-FREE model for ad-hoc retrieval
Tiago Almeida | Sérgio Matos
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Transformer-based “behemoths” have grown in popularity, as well as structurally, shattering multiple NLP benchmarks along the way. However, their real-world usability remains a question. In this work, we empirically assess the feasibility of applying transformer-based models in real-world ad-hoc retrieval applications by comparison to a “greener and more sustainable” alternative, comprising only 620 trainable parameters. We present an analysis of their efficacy and efficiency and show that considering limited computational resources, the lighter model running on the CPU achieves a 3 to 20 times speedup in training and 7 to 47 times in inference while maintaining a comparable retrieval performance. Code to reproduce the efficiency experiments is available on “https://github.com/bioinformatics-ua/EACL2021-reproducibility/“.

2020

pdf bib
Frugal neural reranking: evaluation on the Covid-19 literature
Tiago Almeida | Sérgio Matos
Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020

The Covid-19 pandemic urged the scientific community to join efforts at an unprecedented scale, leading to faster than ever dissemination of data and results, which in turn motivated more research works. This paper presents and discusses information retrieval models aimed at addressing the challenge of searching the large number of publications that stem from these studies. The model presented, based on classical baselines followed by an interaction based neural ranking model, was evaluated and evolved within the TREC Covid challenge setting. Results on this dataset show that, when starting with a strong baseline, our light neural ranking model can achieve results that are comparable to other model architectures that use very large number of parameters.

2015

pdf bib
BioinformaticsUA: Machine Learning and Rule-Based Recognition of Disorders and Clinical Attributes from Patient Notes
Sérgio Matos | José Sequeira | José Luís Oliveira
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

2014

pdf bib
BioinformaticsUA: Concept Recognition in Clinical Narratives Using a Modular and Highly Efficient Text Processing Framework
Sérgio Matos | Tiago Nunes | José Luís Oliveira
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)