[LREC2026 Tutorial] Low-Resource, High-Impact: Building Corpora for Inclusive Language Technologies

Event Notification Type: 
Other
Location: 
LREC2026
Saturday, 16 May 2026
Country: 
Spain
City: 
Palma

Dear colleagues,

Do you care about improving language technologies beyond mainstream languages? Do you wonder how to collect data for low-resource languages? Or how to create the first translation system? And then adapt efficiently to various downstream tasks?

We are pleased to announce an upcoming LREC2026 tutorial
'Low-Resource, High-Impact: Building Corpora for Inclusive Language Technologies.'

This tutorial is aimed at NLP practitioners, researchers, and developers working with multilingual and low-resource languages who are interested in building more equitable, inclusive, and socially impactful language technologies.

Tutorial overview
The tutorial covers the full lifecycle of NLP technologies development for a language, including:

  • Data collection and corpus creation (e.g., web crawling and annotation)
  • Parallel sentence mining and machine translation
  • Downstream applications such as text classification and multimodal reasoning
  • Strategies for addressing data scarcity, cultural variance, and reproducibility
  • Fair and community-informed development practices

Who should attend

  • Researchers and practitioners in NLP and multilingual technologies
  • Corpus builders and linguists working on under-represented languages
  • Developers interested in low-resource or inclusive NLP
  • Students and early-career researchers

Scope and highlights

  • Case studies spanning 10+ languages from diverse language families and geopolitical contexts
  • Coverage of both digitally resource-rich and severely under-represented languages
  • Emphasis on hands-on methods and applied modeling frameworks

Save the date and place:
Saturday, 16 May 2026, morning session
, Room 6

More information:
https://tum-nlp.github.io/low-resource-tutorial/

Stay tuned for our website – we will fully open-source the tutorial materials!

Additionally, we would like to have an overview of the practices and challenges researchers face when collecting and annotating datasets for under-represented languages.
If you are such a researcher, you are working on a very surprising language or just have experience to share about the topic, please fill in this form to participate in the interview: https://forms.gle/L81hpvZGfemyMjtX7

Organisers:

  • Ekaterina (Katya) Artemova, Toloka.ai
  • Laurie Burchell, Common Crawl Foundation
  • Daryna Dementieva, Technical University of Munich
  • Shu Okabe, Technical University of Munich
  • Mariya Shmatova, Toloka.ai
  • Pedro Ortiz Suarez, Common Crawl Foundation

See you at LREC!

Best regards,
The Tutorial Organisers