SPLAT: Speech-Language Joint Pre-Training for Spoken Language Understanding

Yu-An Chung, Chenguang Zhu, Michael Zeng


Abstract
Spoken language understanding (SLU) requires a model to analyze input acoustic signal to understand its linguistic content and make predictions. To boost the models’ performance, various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text. However, the inherent disparities between the two modalities necessitate a mutual analysis. In this paper, we propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules. Besides conducting a self-supervised masked language modeling task on the two individual modules using unpaired speech and text, SPLAT aligns representations from the two modules in a shared latent space using a small amount of paired speech and text. Thus, during fine-tuning, the speech module alone can produce representations carrying both acoustic information and contextual semantic knowledge of an input acoustic signal. Experimental results verify the effectiveness of our approach on various SLU tasks. For example, SPLAT improves the previous state-of-the-art performance on the Spoken SQuAD dataset by more than 10%.
Anthology ID:
2021.naacl-main.152
Volume:
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:
June
Year:
2021
Address:
Online
Editors:
Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, Yichao Zhou
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1897–1907
Language:
URL:
https://aclanthology.org/2021.naacl-main.152
DOI:
10.18653/v1/2021.naacl-main.152
Bibkey:
Cite (ACL):
Yu-An Chung, Chenguang Zhu, and Michael Zeng. 2021. SPLAT: Speech-Language Joint Pre-Training for Spoken Language Understanding. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1897–1907, Online. Association for Computational Linguistics.
Cite (Informal):
SPLAT: Speech-Language Joint Pre-Training for Spoken Language Understanding (Chung et al., NAACL 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.naacl-main.152.pdf
Video:
 https://aclanthology.org/2021.naacl-main.152.mp4
Data
CMU-MOSEIFluent Speech CommandsLibriSpeechSQuADSpoken-SQuAD