Multimodal Speech Recognition with Unstructured Audio Masking

Tejas Srinivasan, Ramon Sanabria, Florian Metze, Desmond Elliott


Abstract
Visual context has been shown to be useful for automatic speech recognition (ASR) systems when the speech signal is noisy or corrupted. Previous work, however, has only demonstrated the utility of visual context in an unrealistic setting, where a fixed set of words are systematically masked in the audio. In this paper, we simulate a more realistic masking scenario during model training, called RandWordMask, where the masking can occur for any word segment. Our experiments on the Flickr 8K Audio Captions Corpus show that multimodal ASR can generalize to recover different types of masked words in this unstructured masking setting. Moreover, our analysis shows that our models are capable of attending to the visual signal when the audio signal is corrupted. These results show that multimodal ASR systems can leverage the visual signal in more generalized noisy scenarios.
Anthology ID:
2020.nlpbt-1.2
Volume:
Proceedings of the First International Workshop on Natural Language Processing Beyond Text
Month:
November
Year:
2020
Address:
Online
Editors:
Giuseppe Castellucci, Simone Filice, Soujanya Poria, Erik Cambria, Lucia Specia
Venue:
nlpbt
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11–18
Language:
URL:
https://aclanthology.org/2020.nlpbt-1.2
DOI:
10.18653/v1/2020.nlpbt-1.2
Bibkey:
Cite (ACL):
Tejas Srinivasan, Ramon Sanabria, Florian Metze, and Desmond Elliott. 2020. Multimodal Speech Recognition with Unstructured Audio Masking. In Proceedings of the First International Workshop on Natural Language Processing Beyond Text, pages 11–18, Online. Association for Computational Linguistics.
Cite (Informal):
Multimodal Speech Recognition with Unstructured Audio Masking (Srinivasan et al., nlpbt 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.nlpbt-1.2.pdf
Video:
 https://slideslive.com/38939780
Data
SPEECH-COCO