LAMBADA test set release

Event Notification Type: 
Other
Abbreviated Title: 
Location: 
State: 
Country: 
City: 
Contact: 
Denis Paperno
Sandro Pezzelle

We are happy to announce the release of the test portion of the LAMBADA dataset (LAnguage Modeling Broadened to Account for Discourse Aspects). LAMBADA aims at testing computational models of natural language on their ability to integrate information from a larger context than a single sentence or an n-gram window. Current models have a very hard time with discourse context in general, and with the LAMBADA task specifically (as shown in Paperno et al. 2016). By releasing the test set, we hope to encourage research in this area, to help move AI towards real natural language understanding.

IN A NUTSHELL

LAMBADA website: http://clic.cimec.unitn.it/lambada.

Reference: D. Paperno, G. Kruszewski, A. Lazaridou, Q. Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda and R. Fernández. The LAMBADA dataset: Word prediction requiring a broad discourse context. Proceedings of ACL 2016 (54th Annual Meeting of the Association for Computational Linguistics), East Stroudsburg PA: ACL, pages 1525-1534.

DETAILS

LAMBADA is a collection of narrative passages sharing the characteristic that human subjects are able to guess their last word if they are exposed to a long passage, but not if they only see the last sentence preceding the target word. For example, this is a sample data point in the dataset:
Context: "Yes, I thought I was going to lose the baby." "I was scared too," he stated, sincerity flooding his eyes. "You were?" "Yes, of course. Why do you even ask?" "This baby wasn't exactly planned for."
Target sentence: "Do you honestly think that I would want you to have a ________?"
Target word: miscarriage

The LAMBADA task consists in predicting the target word given the whole passage (i.e., the context plus the target sentence). For more information and download, visit the dataset’s website: http://clic.cimec.unitn.it/lambada.

Acknowledgements: This project has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 655577 (LOVe); ERC 2011 Starting Independent Research Grant n.~283554 (COMPOSES); NWO VIDI grant n.~276-89-008 (Asymmetry in Conversation). LAMBADA passages were extracted from the BookCorpus (http://www.cs.toronto.edu/~mbweb).