Multimodal Speech Recognition with Unstructured Audio Masking

Visual context has been shown to be useful for automatic speech recognition (ASR) systems when the speech signal is noisy or corrupted. Previous work, however, has only demonstrated the utility of visual context in an unrealistic setting, where a fixed set of words are systematically masked in the audio. In this paper, we simulate a more realistic masking scenario during model training, called RandWordMask, where the masking can occur for any word segment. Our experiments on the Flickr 8K Audio Captions Corpus show that multimodal ASR can generalize to recover different types of masked words in this unstructured masking setting. Moreover, our analysis shows that our models are capable of attending to the visual signal when the audio signal is corrupted. These results show that multimodal ASR systems can leverage the visual signal in more generalized noisy scenarios.


Introduction
Jointly modelling linguistic and visual signals is beneficial for several language processing tasks, such as machine translation (Sulubacak et al., 2019), visual question-answering (VQA) (Antol et al., 2015), summarization  and automatic speech recognition (ASR) . However, it is unclear exactly how the visual signals are useful for these tasks. For example, in VQA, it has been observed that models can ignore the visual context and instead rely on linguistic biases in the dataset (Ramakrishnan et al., 2018;Grand and Belinkov, 2019); in machine translation, it has been shown that some models are not affected by incorrect visual signals (Elliott, 2018); and in multimodal ASR, the visual signals were shown to act as a regularizer instead of useful disambiguating context . Given these uncer- tainties, there is a need to clarify the circumstances in which visual signals are useful.
Previous work in multimodal machine translation  and ASR (Srinivasan et al., 2020) shows that the visual signal is useful when the linguistic signal is degraded by dropping the input. In this setting, multimodal models leverage the visual signals to recover the missing language information. The results in (Srinivasan et al., 2020) are a promising start towards verifiably useful multimodality for robust speech recognition. However, the experiments were conducted with structured noise that focused on a predetermined set of groundable entities (i.e., nouns and places). In real world scenarios, however, noise occurs in a more unstructured manner. Therefore, it is important that multimodal models can use the visual signal in a wider variety of situations.
In this work, we study multimodal ASR in more realistic noisy scenarios. We follow the methodology from (Srinivasan et al., 2020) Figure 2: Our unimodal ASR model, along with several of our fusion methods for integrating a visual context vector (in blue) into the ASR model. The two fusion methods not displayed above, Weighted-DF and Middle-DF, were constructed similar to Early-DF and HierAttn-DF respectively mask words in an unstructured manner in the audio signal (we refer to this as RandWordMask). This is in contrast to the structured masking in (Srinivasan et al., 2020), where the masked audio corresponds to only entities (which we refer to as En-tityMask). The example in Figure 1 shows that RandWordMask can mask any words in the audio signal, whereas EntityMask would only mask entities like "girl" and "slide". We apply masking both during training and testing.
The main contributions of this work are: • We simulate a more realistic masking scenario, called RandWordMask 1 , during training and testing of our ASR models (Section 2).
• We propose several multimodal models (Section 2.2), and show that training with Rand-WordMask improves their ability to recover masked words (Section 4).
• We show that our multimodal ASR models are right for the right reasons through several quantitative analyses (Section 4.1, 4.2, 4.4).
The results show that visual signals improve speech recognition in this more difficult, unstructured setting where random words are masked. Our models are not only able to recover masked entities, but they also recover words from other syntactic 1 We note that RandWordMask is different from robust ASR (Barker et al., 2018) scenarios, where the whole signal is corrupted with stationary noise. categories, e.g., adjectives, cardinals, and verbs. Furthermore, our analysis shows that our models when trained using RandWordMask attend to the visual signal when the audio signal is unavailable. This confirms that the visual context can be leveraged when the primary audio signal is masked.

Methodology
In this section, we describe the different ASR models and our technique for simulating unstructured audio masking.

Unimodal ASR Model
Our unimodal ASR model is a word-level  sequence-to-sequence model with attention (Bahdanau et al., 2016;Chan et al., 2016), identical to the model used in (Srinivasan et al., 2020). The encoder (E) consists of 6 bidirectional LSTM layers (Schuster and Paliwal, 1997;Hochreiter and Schmidhuber, 1997) with temporal sub-sampling (Chan et al., 2016) in the middle two layers. The decoder is a two-layer conditional gated-recurrent-unit (Cho et al., 2014) which computes attention over the encoder states E.

Multimodal ASR Models
We explore several fusion methods to integrate a visual feature vector v into the unimodal ASR model. Encoder Feature Fusion: We use a visual adaptation method similar to , which we call Shift Adaptation. The visual feature vector v is projected down to the speech feature dimension; the resulting "shift vector" s is then added to the input speech features at all timesteps.
Decoder Feature Fusion: Instead of integrating the visual features into the encoder, we can integrate them in the decoder. We hypothesize that this will bias the ASR's language modelling capacity. Anastasopoulos et al. (Anastasopoulos et al., 2019) explore several strategies for incorporating visual features into an LSTM language model. We employ similar fusion methods in our decoder.

Early Decoder Fusion (Early-DF):
At each timestep, we concatenate v to the input embedding y t , which is then projected down to the embedding dimension.
2. Weighted Early Decoder Fusion (Weighted-DF): We calculate a timestepdependent weighted scalar between the input embedding y t and the embedded visual features v (Eqn. 7), which scales the contribution of the visual features in the concatenated input (Eqn. 8):

Middle Decoder Fusion (Middle-DF):
In this approach, fusion occurs between the GRU layers at z t (Eqn. 2), which is the input to the 2nd decoder layer: 4. Hierarchical Attention over Features (HierAttn-DF): In this approach, we add a hierarchical attention layer (Libovickỳ and Helcl, 2017) that attends between the encoder context vector z t (Eqn. 2) and the visual feature vector v. The hierarchical context vector z hier t is the input to the second decoder layer (Eqn. 3): By conditioning the hierarchical attention on the output of the first decoder layer, the attention layer learns to decide which of the audio and visual modalities is more important for decoding at a given timestep.

Unstructured Masked Audio: RandWordMask
We simulate a degradation of the audio signal by randomly masking words in the audio with silence. This approach differs from (Srinivasan et al., 2020), where they masked a fixed set of words corresponding to entities, i.e., nouns and places. Figure 1 shows an example of an audio spectrogram with RandWordMask. The intuition behind random word masking, as opposed to entity-based word masking, is that noise in the audio signals is unlikely to systematically occur when someone is speaking about an entity. Our multimodal ASR models need to be responsive to audio that drops outside systematically expected regions.
In real-world settings, the rate at which the speech is masked (unavailable) is highly variable. Therefore, we train the models with an augmented version of the dataset: for each audio utterance, we create four masked audio samples, where words are masked with 0%, 20%, 40% and 60% probability. Note that the text transcript (y 1...N ) and image modality (v) remain intact. This approach to augmenting the dataset will result in models that can adapt to different amounts of corruption in the audio signal during evaluation.

Dataset
We perform experiments on the Flickr 8K Audio Caption Corpus (Harwath and Glass, 2015), which contains 40,000 spoken captions (total 65 hours of speech) corresponding to 8,000 natural images from the Flickr8K dataset (Hodosh et al., 2015). The augmented dataset that we use for training and testing (as described in Section 2.3) consists of 160,000 spoken captions.
In addition, we use the SpeechCOCO dataset (Havard et al., 2017) for pretraining. SpeechCOCO contains over 600 hours of synthesised speech paired with images.

Audio Features
We extract 43-dimensional filter bank features in an identical manner to (Srinivasan et al., 2020). In order to mask the audio, we first extract wordaudio alignments from a pre-trained GMM-HMM model and expand the start and end timing marks by 25% of the segment duration to account for misalignments. We mask words in the audio by replacing word segments with 0.5 seconds silence.

Visual Features
We extract visual features from a ResNet-50 CNN (He et al., 2016) pre-trained on ImageNet. Specifically, we extract features from the 2048dim average pooling layer, and project these to 256-dim through a learned linear layer: v = W · CNN(img)

Model Implementation
We use the same model hyperparameters as in (Srinivasan et al., 2020). Models are trained using the nmtpytorch framework (Caglayan et al., 2017). We first pre-train our models for 25,000 minibatches on the SpeechCOCO dataset. This pretraining step, inspired by (Ilharco et al., 2019), was crucial to ensure stable training of our models on the Flickr 8K dataset.

Evaluation Metrics
Our model evaluation (Table 1a) has been conducted on the development set of Flickr8k-Audio, while the rest of our analysis is conducted on the test set. We report WER for all our models. For datasets where words have been masked in the audio signal, we compute Recovery Rate (Srinivasan et al., 2020), which measures the percentage of masked words which have been correctly recovered in the transcription.
In addition, we can determine the contribution of the visual signal when decoding each word in the HierAttn-DF model. We do this by inspecting the weights of the audio and visual modalities in the hierarchical attention mechanism. We introduce a new metric to quantify this: Grounding Rate (G.R.).
G.R. = #recovered words where visual attn > 0.5 #correctly recovered masked words We choose 0.5 as the threshold since above this value, more attention was given to the visual modality than the audio. G.R. thus represents the percent-age of recovered words where the model was focusing more on the visual context while decoding.

Results and Analysis
In Table 1a, we summarize the performance of our unimodal ASR and proposed multimodal ASR models. Our development set is constructed similarly to our training set described in Section 2.3, consisting of samples with 0%, 20%, 40% and 60% of words masked. We examine performance on this Augmented dataset, as well as datasets at each individual masking level. We see that the Decoder-Fusion (DF) multimodal models outperform unimodal ASR on both WER and RR. However, the best-performing models on both metrics differ: Weighted-DF achieves the lowest WER, with an improvement of 1.40% on the augmented dataset. HierAttn-DF has the best Recovery Rate, with an absolute improvement of 4% over the Unimodal model. These trends hold across all masking levels. Moreover, we observe that as the amount of masking in the audio signal increases, the WER and RR gains of our models increase. The ShiftAdapt model, which integrates the visual features with the speech encoder input, does not show any improvements over unimodal ASR. We observe that ShiftAdapt shows improvements when trained and tested on clean data, which aligns with the regularization signal previously observed in .
The results in Table 1a show that multimodality can recover words which were masked in an unstructured manner. We now turn our attention to analysing which types of words are recovered better. We conduct this analysis across seven categories: five syntactic (nouns, verbs, adjectives, adverbs and cardinals) and two semantic (places and colors). 2 For each category, we create a new test set where we mask all word occurrences. We note that these categories are varying degrees of "groundable", which we define as how easily identifiable they are in the visual modality -the more groundable a category, the easier it is to identify words belonging to that category in the visual context. Nouns and places are the most groundable categories, while adjectives and colors are also frequently easy to identify in the image. Verbs and adverbs, however, are less groundable categories.
In Table 1b  the unimodal ASR and HierAttn-DF (the best multimodal model in terms of RR) on the different word types. We observe that on the groundable entities i.e., nouns and places, there is a relative improvement of at least 25% compared to the Unimodal model. Adjectives and colors, which are also groundable in the visual modality, are recovered around 14% better than the Unimodal model. The relative RR improvement for verbs is around 7%, whereas adverbs recovery is 4% worse. These results show that visual context can recover words from a variety of categories, even though it is better at recovering entities, and struggles with words that are less groundable in the image.

Hierarchical Attention Analysis
In Table 1b, we also summarize the Grounding Rate of HierAttn-DF when recovering different types of words. We find that the most groundable words (nouns and places), have a Grounding Rate > 90%. This means that 90% of the time the nouns/places were correctly recovered, the visual modality was being attended to. Adjectives and verbs, which are also groundable, have a grounding rate of ≈ 76%. These trends confirm that the model's improvements in masked word recovery are coming from using the visual signal. In addition to calculating the Grounding Rate, we also check whether the model learns to "look" at the visual modality when it encounters a masked word. In Figure 3, we plot the average visual attention weight at the masked word timestep, as well as the two preceding and proceeding timesteps. We see that the more groundable the word category, the more attention it learns to pay to the visual modality when the word is masked.
In Table 2, we present some qualitative examples where we visualize how the attention to each modality evolves with time. We observe that the timesteps corresponding to masked words in the signal have significantly higher visual attention. We see that in the first example, all masked words are correctly recovered. In the second example,

Utility of RandWordMask Training
We compare our RandWordMask training scheme with the EntityMask training mechanism from (Srinivasan et al., 2020). In EntityMasking, only entities (nouns) are masked during training, and we hypothesize that this makes the model better at recovering entities but unable to generalize to other word types. Since RandWordMask training involves masking words at random, we expect the model should be able to generalize better to other words types. In Table 3

Silence vs Whitenoise Masking
Our results in Tables 1a and 1b are performed in the experimental setting where words are masked with silence. However, another masking strategy explored in (Srinivasan et al., 2020) is white noise masking, where the masked word is replaced with white noise in the audio signal. (Srinivasan et al., 2020) had reported results in both masking scenarios, and noted that the improvements of the multimodal ASR model were similar in both scenarios. We further verify this by training unimodal and HierAttn-DF ASR models using RandWordMask, but with white noise masking instead of silence.
In Table 4, we report the Recovery Rates of both ASR models in both silence and white noise  Table 5: Recovery Rates (%) for the HierAttn-DF model when provided with correct (congruent) and misaligned (incongruent) image masking scenarios. We observe that while recovery is generally harder with white noise masking (evidenced by lower RR of both unimodal and multimodal ASR models), the HierAttn-DF model shows approximately the same absolute improvements in RR over the unimodal ASR. This indicates that the multimodal model can be applied to the more difficult white noise masking as well.

Congruency Analysis
We perform a sanity check of our model by misaligning audio utterances and images while decoding the trained model (Elliott, 2018). This evaluation quantifies the sensitivity of the model towards the visual modality. A model that is sensitive to the visual context would perform significantly worse when presented with an unrelated (incongruent) image during evaluation. Since the model has been trained to actively use the image, it is likely to extract incorrect information. In Table 5, we see that the HierAttn-DF model is substantially affected by the unrelated images (the recovery rate drops on average by 7%). This verifies that our multimodal models are sensitive to the image modality.

Conclusions
We show that visual signals improve multimodal speech recognition when the audio signal is subject to unstructured masking. RandWordMask simulates a wider range of noisy scenarios by masking different types of words in the audio signal during training and evaluation, as opposed to previous work that only masked groundable entities (Srinivasan et al., 2020). Future work involves developing new models that attend over visual features extracted from object proposals, which provide better visual signals.