Joint Learning from Labeled and Unlabeled Data for Information Retrieval

Bo Li, Ping Cheng, Le Jia


Abstract
Recently, a significant number of studies have focused on neural information retrieval (IR) models. One category of works use unlabeled data to train general word embeddings based on term proximity, which can be integrated into traditional IR models. The other category employs labeled data (e.g. click-through data) to train end-to-end neural IR models consisting of layers for target-specific representation learning. The latter idea accounts better for the IR task and is favored by recent research works, which is the one we will follow in this paper. We hypothesize that general semantics learned from unlabeled data can complement task-specific representation learned from labeled data of limited quality, and that a combination of the two is favorable. To this end, we propose a learning framework which can benefit from both labeled and more abundant unlabeled data for representation learning in the context of IR. Through a joint learning fashion in a single neural framework, the learned representation is optimized to minimize both the supervised loss on query-document matching and the unsupervised loss on text reconstruction. Standard retrieval experiments on TREC collections indicate that the joint learning methodology leads to significant better performance of retrieval over several strong baselines for IR.
Anthology ID:
C18-1025
Volume:
Proceedings of the 27th International Conference on Computational Linguistics
Month:
August
Year:
2018
Address:
Santa Fe, New Mexico, USA
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
293–302
URL:
https://www.aclweb.org/anthology/C18-1025
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
https://www.aclweb.org/anthology/C18-1025.pdf