Dear colleagues,
Do you care about improving language technologies beyond mainstream languages? Do you wonder how to collect data for low-resource languages? Or how to create the first translation system? And then adapt efficiently to various downstream tasks?
We are pleased to announce an upcoming LREC2026 tutorial
'Low-Resource, High-Impact: Building Corpora for Inclusive Language Technologies.'
This tutorial is aimed at NLP practitioners, researchers, and developers working with multilingual and low-resource languages who are interested in building more equitable, inclusive, and socially impactful language technologies.
Tutorial overview
The tutorial covers the full lifecycle of NLP technologies development for a language, including:
- Data collection and corpus creation (e.g., web crawling and annotation)
- Parallel sentence mining and machine translation
- Downstream applications such as text classification and multimodal reasoning
- Strategies for addressing data scarcity, cultural variance, and reproducibility
- Fair and community-informed development practices
Who should attend
- Researchers and practitioners in NLP and multilingual technologies
- Corpus builders and linguists working on under-represented languages
- Developers interested in low-resource or inclusive NLP
- Students and early-career researchers
Scope and highlights
- Case studies spanning 10+ languages from diverse language families and geopolitical contexts
- Coverage of both digitally resource-rich and severely under-represented languages
- Emphasis on hands-on methods and applied modeling frameworks
Save the date and place:
Saturday, 16 May 2026, morning session, Room 6
More information:
https://tum-nlp.github.io/low-resource-tutorial/
Stay tuned for our website – we will fully open-source the tutorial materials!
Additionally, we would like to have an overview of the practices and challenges researchers face when collecting and annotating datasets for under-represented languages.
If you are such a researcher, you are working on a very surprising language or just have experience to share about the topic, please fill in this form to participate in the interview: https://forms.gle/L81hpvZGfemyMjtX7
Organisers:
- Ekaterina (Katya) Artemova, Toloka.ai
- Laurie Burchell, Common Crawl Foundation
- Daryna Dementieva, Technical University of Munich
- Shu Okabe, Technical University of Munich
- Mariya Shmatova, Toloka.ai
- Pedro Ortiz Suarez, Common Crawl Foundation
See you at LREC!
Best regards,
The Tutorial Organisers