Call for Shared Task Participation: Data Contamination Evidence Collection - CONDA workshop @ ACL 2024

Event Notification Type: 
Call for Participation
Abbreviated Title: 
CONDA2024
Location: 
ACL2024
Country: 
Thailand
City: 
Bangkok
Submission Deadline: 
Monday, 1 July 2024

We invite the community to participate in a shared task organized in the context of the CONDA workshop https://conda-workshop.github.io/.

Data contamination, where evaluation data is inadvertently included in pre-training corpora of large scale models, and language models (LMs) in particular, has become a concern in recent times (Sainz et al. 2023; Jacovi et al. 2023). The growing scale of both models and data, coupled with massive web crawling, has led to the inclusion of segments from evaluation benchmarks in the pre-training data of LMs (Dodge et al., 2021; OpenAI, 2023; Google, 2023; Elazar et al., 2023). The scale of internet data makes it difficult to prevent this contamination from happening, or even detect when it has happened (Bommasani et al., 2022; Mitchell et al., 2023). Crucially, when evaluation data becomes part of pre-training data, it introduces biases and can artificially inflate the performance of LMs on specific tasks or benchmarks (Magar and Schwartz, 2022). This poses a challenge for fair and unbiased evaluation of models, as their performance may not accurately reflect their generalization capabilities.

The shared task is a community effort we are organizing a community effort on centralized data contamination evidence collection. While the problem of data contamination is prevalent and serious, the breadth and depth of this contamination are still largely unknown. The concrete evidence of contamination is scattered across papers, blog posts, and social media, and it is suspected that the true scope of data contamination in NLP is significantly larger than reported.

With this shared task we aim to provide a structured, centralized platform for contamination evidence collection to help the community understand the extent of the problem and to help researchers avoid repeating the same mistakes. The shared task also gathers evidence of clean, non-contaminated instances. The platform is already available for perusal at https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Database.

Participants in the shared task need to submit their contamination evidence (see instructions below). The CONDA 2024 workshop organizers will review the evidence through pull requests.

Compilation Paper

As a companion to the contamination evidence platform, we will produce a paper that will provide a summary and overview of the evidence collected in the shared task. The participants who contribute to the shared task will be listed as co-authors in the paper.

Instructions for Evidence Submission

Each submission should report a case of contamination or lack of contamination thereof. The submission can be either about (1) contamination in the corpus used to pre-train language models, where the pre-training corpus contains a specific evaluation dataset, or about (2) contamination in a model that shows evidence of having seen a specific evaluation dataset while being trained. Each submission needs to mention the corpus (or model) and the evaluation dataset, in addition to some evidence of contamination. Alternatively, we also welcome evidence of a lack of contamination.

Reports must be submitted through a Pull Request in the Data Contamination Report space at HuggingFace. The reports must follow the Contribution Guidelines provided in the space and will be reviewed by the organizers. If you have any questions, please contact us at conda-workshop [at] googlegroups.com or open a discussion in the space itself.

URL with contribution guidelines: https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Database (“Contribution Guidelines” tab)

Important dates

  • Deadline for evidence submission: July 1, 2024
  • Workshop day: August 16, 2024

Sponsors

  • AWS AI and Amazon Bedrock
  • HuggingFace
  • Google

Contact

Organizers
Oscar Sainz, University of the Basque Country (UPV/EHU)
Iker García Ferrero, University of the Basque Country (UPV/EHU)
Eneko Agirre, University of the Basque Country (UPV/EHU)
Jon Ander Campos, Cohere
Alon Jacovi, Bar Ilan University
Yanai Elazar, Allen Institute for Artificial Intelligence and University of Washington
Yoav Goldberg, Bar Ilan University and Allen Institute for Artificial Intelligence