Textual supervision for visually grounded spoken language understanding

Visually-grounded models of spoken language understanding extract semantic information directly from speech, without relying on transcriptions. This is useful for low-resource languages, where transcriptions can be expensive or impossible to obtain. Recent work showed that these models can be improved if transcriptions are available at training time. However, it is not clear how an end-to-end approach compares to a traditional pipeline-based approach when one has access to transcriptions. Comparing different strategies, we find that the pipeline approach works better when enough text is available. With low-resource languages in mind, we also show that translations can be effectively used in place of transcriptions but more data is needed to obtain similar results.


Introduction
Spoken language understanding promises to transform our interactions with technology by allowing people to control electronic devices through voice commands.However, mapping speech to meaning is far from trivial.The traditional approach, which has proven its effectiveness, relies on text as an intermediate representation.This so-called pipeline approach combines an automatic speech recognition (ASR) system and a natural language understanding (NLU) component.While this allows us to take advantage of improvements achieved in both fields, it requires transcribed speech, which is an expensive resource.
Visually-grounded models of spoken language understanding (Harwath et al., 2016;Chrupała et al., 2017) were recently introduced to extract semantic information from speech directly, without relying on textual information (see Figure 1 for an illustration).The advantages of these approaches are twofold: (i) expensive transcriptions are not necessary to train the system, which is beneficial A boy in a green shirt on a skateboard on a stone wall with graffiti.

落書きされた石の壁でスケートボードに乗っている緑色の シャツを着た少年。
Figure 1: An image described by an English spoken caption (represented by its spectrogram), its transcription, and translation into Japanese.Visually-grounded models are usually trained to map the image and its spoken caption into a shared semantic space.
for low-resource languages and (ii) trained in an end-to-end fashion, the whole system is optimized for the final task, which has been shown to improve performance in other applications.While text is not necessary to train such systems, recent work has shown that they can greatly benefit from textual supervision if available (Chrupała, 2019;Pasad et al., 2019), generally using the multitask learning setup (MTL) (Caruana, 1997).
However, the end-to-end MTL-based models in previous works have not been compared against the more traditional pipeline approach that uses ASR as an intermediate step.The pipeline approach could be a strong baseline as, intuitively, written transcriptions are an accurate and concise representation of spoken language.In this paper, we set out to determine the relative differences in performance between end-to-end approaches and pipeline-based approaches.This study provides insights from a pragmatic point of view, as well as having important consequences for making further progress in understanding spoken language.
We also explore the question of the exact nature of the textual representations to be used in the visually-grounded spoken language scenario.The text used in previous work were transcriptions, which are a relatively faithful representation of the form of the spoken utterance.Other possibilities include, for example, subtitles, which tend to be less literal and abbreviated, or translations, which express the meaning of the utterance in another language.We focus on the case of translations due to the relevance of this condition for low-resource languages: some languages without a standardized writing system have many millions of speakers.One example is Hokkien, spoken in Taiwan and Southeast China: it may be more practical to collect translations of Hokkien into Mandarin than to get them transcribed.The question is whether translations would be effective as a source of textual supervision in visually-grounded learning.
In summary, our contributions are the following.
• We compare different strategies for leveraging textual supervision in the context of visuallygrounded models of spoken language understanding: we compare a pipeline approach, where speech is first converted to text, with two end-to-end MTL systems.We find that the pipeline approach tends to be more effective when enough textual data is available.
• We analyze how the amount of transcribed data affects performance, showing that endto-end training is competitive only in very limited text conditions; however, textual supervision via transcribed data is marginally effective at this stage.
• We explore the possibility of replacing transcriptions with written translations.In the case of translations, an end-to-end MTL approach outperforms the pipeline baselines; we also observe that more data is necessary with translations than with transcriptions, due to the more complex task faced by the system.
2 Related work

Visually-grounded models of spoken language understanding
Recent work has shown that semantic information can be extracted from speech in a weakly supervised manner when matched visual information is available.This approach is usually referred to as visually-grounded spoken language understanding.While original work focused on single words (Synnaeve et al., 2014;Harwath and Glass, 2015), the concept has quickly been extended to process full sentences (Harwath et al., 2016;Chrupała et al., 2017).Applied on datasets of images with spoken captions, this type of model is typically trained to perform a speech-image retrieval task where an utterance can be used to retrieve an image which it is a description of (or vice-versa).This is achieved through a triplet loss between images and utterances (in both directions).
A similar approach is used by Kamper and Roth (2018) to perform semantic keyword spotting.They include an image tagger in their model to provide tags for each image in order to retrieve sentences that match a keyword semantically, i.e. not only exact matches but also semantically related ones.

Textual supervision
A common thread in all of these studies is that the spoken sentences do not need to be transcribed, which is useful due to the cost attached to textual labeling.Subsequent work, however, showed that textual supervision, if available, can substantially improve performance.Chrupała (2019) uses transcriptions through multitask learning.He finds that adding a speech-text matching task, where spoken captions have to be matched with corresponding transcriptions, is particularly helpful.Pasad et al. (2019) applied the same idea to semantic keyword spotting with similar results.They also examine the effect of decreasing the size of the dataset.Hsu et al. (2019) explore the use of visually grounded models to improve ASR through transfer learning from the semantic matching task; in contrast, we are interested in improving the performance of the grounded model itself using textual supervision.

Multitask versus pipeline
Another area of related work is found in the spoken command understanding literature.Haghani et al. (2018) compare different architectures making use of textual supervision, covering both pipeline and end-to-end approaches.The models they explore include an ASR-based multitask system similar to the present work.For the pipeline system, they try both independent and joint training of the ASR and NLU components.Their conclusion is that an intermediate textual representation is important to achieve good performance and that jointly optimizing the different components improves predictions.Lugosch et al. (2019) propose a pretraining method for end-to-end spoken command recognition that relies on the availability of transcribed data.However, while this pretraining strategy brings improvement over a system trained without transcripts, the absence of any other text-based baseline (such as a pipeline system) prevents any conclusion on the advantage of the end-to-end training when textual supervision is available.

Multilingual data
The idea of using multilingual data is not new in the literature: existing work focuses on using the same modality for the two languages, either text or speech.Gella et al. (2017) and Kádár et al. (2018) show that textual descriptions of images in different languages in the Multi30K dataset (Elliott et al., 2016) can be used in conjunction to improve the performance of a visually-grounded model.Harwath et al. (2018) focus on speech, exploring how spoken captions in two languages can be used simultaneously to improve performance in an English-Hindi parallel subset of the Places-205 dataset (Zhou et al., 2014).In contrast, our experiments concern the setting where speech data from a low-resource language is used in conjunction with corresponding translated written captions.
Directly mapping speech to textual translation, or spoken language translation (SLT), has received increasing interest lately.Following recent trends in ASR and machine translation, end-to-end approaches in particular have drawn much attention, showing competitive results against pipeline systems (e.g.Bérard et al., 2016;Weiss et al., 2017).

Methodology
The architecture and the training procedure used in this paper are inspired by the improved version of the visually-grounded spoken language understanding system of Merkx et al. (2019).Appendix A.1 provides details on the choice of hyperparameters.

Architectures
We will compare six different models, based on five architectures (summarized in Figure 2).Two models serve as reference, the original speech-image matching system (Harwath et al., 2016;Chrupała et al., 2017) that does not rely on text and a text-image model that works directly on text (and thus requires text even at test time; similar to the Text-RHN model from Chrupała et al. (2017) or the Char-GRU model of Merkx et al. (2019)).
We then have the four core models which use text during training but can also work with speech only at test time.Those are the four models we are mainly interested in.They comprise: two pipeline models, which only differ in their training procedure, and two multitask systems, using either a speech-text retrieval task (similar to what is done in Chrupała (2019) and Kamper and Roth (2018)) or ASR as the secondary target.

The speech-image baseline
The speech-image baseline (speech-image) is composed of two main components: an image encoder and a speech encoder (see Figure 2a).
Image encoder.The image encoder is composed of a single linear layer projecting the image features (see Section 3.4.2) into the shared semantic embedding space (dimension 2048), followed by a normalization layer ( 2 norm).
Speech encoder.The speech encoder is applied on the mel-frequency cepstrum coefficient (MFCC) features described in Section 3.4.2and composed of a 1D convolutional layer (kernel size 6, stride 2 and 64 output channels), followed by bidirectional gated recurrent units (GRUs) Cho et al. (2014) (4 layers, hidden state of dimension 1024).A vectorial attention layer is then used to convert the variable length input sequence to a fixed-size vector of dimension 2048.Finally, a normalization layer ( 2 norm) is applied.

The text-image baseline
The text-image baseline (text-image) measures the possible performance if text is available at test time.It serves as a high estimate of what the four core models could achieve.Those models could theoretically perform better than the text-image baseline by taking advantage of information available in speech and not in text.However, extracting the equivalent of the textual representation from speech is not trivial so we expect them to perform worse.
The text-image model is comprised of an image encoder and a text encoder (see Figure 2b).The image encoder is identical to the speech-image model.
Text encoder.The text encoder is characterbased and maps the input characters to a 128dimensional space through an embedding layer.The output is then fed to a bidirectional GRU (2 layers, hidden state of dimension 1024).A vectorial attention mechanism followed by a normalization layer ( 2 norm) summarizes the variable-length sequence into fixed-length vector (dimension 2048).

The pipeline models
We trained two pipeline models (pipe-ind and pipeseq) which only differ in their training procedure (see Section 3.2.2).The architecture (summarized in Figure 2c) is basically composed of an ASR module which maps speech to text, followed by the text-image system we just described (our NLU component).The same architecture is used when training with Japanese captions, though the first part is then referred to as the SLT module.
ASR/SLT module.The ASR/SLT module is an attention-based encoder-decoder system which can itself be decomposed into two sub-modules, an encoder and an attention-based decoder.The encoder is similar to the speech encoder described above: it is composed of the same convolutional layer followed by a bidirectional GRU (5 layers, hidden state of dimension 768) but lacks the attention and normalization layers.The attention-based decoder uses a timestep-dependent attention mechanism (Bahdanau et al., 2015) to summarize the encoded input sequence into fixed-size context vectors (one per output token).The recurrent decoder generates the output sequence one character at a time.At each time step, it takes the current context vector and the previous character as input.It is composed of a unidirectional GRU (1 layer, hidden state of dimension 768), a linear projection and the softmax activation layer.

The ASR/SLT-based multitask model
The ASR/SLT-based multitask model (mtltranscribe and mtl-translate respectively) combines the speech-image model with an ASR/SLT system (similar to the one used in the pipeline models).To do so, the speech encoder of the speech-image system and the encoder of the ASR/SLT sytem are merged in a single speech encoder composed of a shared network followed by two task-specific networks (see Figure 2d).The image encoder and the attention-based decoder being identical to the ones described previously, we will focus on the partially-shared speech encoder.The multitask training procedure is described in Section 3.2.3.
The mtl-transcribe/mtl-translate speech encoder.The shared part of the speech encoder is composed of a convolutional layer (same configuration as before) and a bidirectional GRU (4 layers, hidden state of dimension 768).The part dedicated to the secondary ASR/SLT task is only composed of an additional bidirectional GRU (1 layer, hidden state of dimension 768).The part dedicated to the speech-image retrieval task is composed of the same GRU layer but also incorporates the vectorial attention and normalization layers necessary to map the data into the audio-visual semantic space.

The text-as-input multitask model
The other multitask model (mtl-match) is based on Chrupała (2019).It combines the speech-image baseline with a speech-text retrieval task (see Figure 2e).Images and text are encoded by subnetworks identical to the image encoder described in Section 3.1.1and the text encoder described in Section 3.1.2respectively.
The mtl-match speech encoder.Similarly to the mtl-transcribe/mtl-translate architecture, the speech encoder is composed of a shared component and two dedicated parts.The shared encoder is again composed of a convolutional layer (same configuration as before), followed by a bidirectional GRU (2 layers, hidden state of dimension 1024).The part of the encoder dedicated to the speechimage task is composed of a GRU (2 layers, hidden state of dimension 1024), followed by the vectorial attention mechanism and the normalization layer ( 2 norm).Its counterpart for the speech-text task is only made of the vectorial attention mechanism and the 2 normalization layer.

Losses
Retrieval loss.The main objective used to train our models is the triplet loss used by Harwath and Glass (2015).The goal is to map images and captions in a shared embedding space where matched images and captions are close to one another and mismatched elements are further apart.This is achieved through optimization of following loss: where (u, i) and (u , i ) are each a pair of matching utterance and image from the current batch, d(•, •) is the cosine distance between encoded utterance and image, and α is some margin (we use the value of 0.2).
Similarly, a network can be trained to match spoken captions with corresponding transcriptions/translations, replacing utterance and image pairs (u, i) with utterance and text pairs (u, t).
ASR/SLT loss.The ASR and SLT tasks are optimized through the usual cross-entropy loss between the decoded sequence (using greedy decoding) and the ground truth text.

Training the pipeline systems
We use two strategies to train the pipeline systems: • The independent training procedure (pipeind model), where each module (ASR/SLT and NLU) is trained independently from the other.
Here the text encoder is trained on ground-truth written captions or translations.
• The sequential training procedure (pipe-seq model), where we first train the ASR/SLT module.Once done, we decode each spoken caption (with a beam search of width 10), and use the output to train the NLU system.Doing so reduces the mismatch between training and testing conditions, which can affect the performance of the NLU component.This second procedure is thus expected to perform better than the independent training strategy.

Multitask learning
The mtl-transcribe/mtl-translate and mtl-match strategies make use of multitask learning through shared weights.To train the models, we simply alternate between the two tasks, updating the parameters of each task in turn.

Optimization procedure
The optimization procedure follows Merkx et al. (2019).We use the Adam optimizer (Kingma and Ba, 2015) with a cyclic learning rate (Smith, 2017) varying from 10 −6 and 2 × 10 −4 .All networks are trained for 32 epochs, and unlike Merkx et al., we do not use ensembling.
To compute these metrics, images and utterances are compared based on the cosine distance between their embeddings resulting in a ranked list of images for each utterance, in order of increasing distance.One can then compute the proportion of utterances for which the paired image appears in the top n images (R@n), or the median rank of the paired image over all utterances.For brevity, we only report results with R@10 in the core of the paper.The complete set of results is available in Appendix A.2.
For ASR and SLT, we report word error rate (WER) and BLEU score respectively, using beam decoding in both cases (with a beam width of 10).
All results we report are the mean over three runs of the same experiment with different random seeds.

Datasets
The visually grounded models presented in this paper require pairs of images and spoken captions for training.For our experiments on textual supervision, we additionally need the transcriptions corresponding to those spoken captions, or alternatively a translated version of these transcripts.We obtain these elements from a set of related datasets: • Flickr8K (Hodosh et al., 2013) offers 8,000 images of everyday situations gathered from the website flickr.comtogether with English written captions (5 per image) that were obtained through crowd sourcing.
• The Flickr Audio Caption Corpus (Harwath and Glass, 2015), augments Flickr8K with spoken captions read aloud by crowd workers.
• F30kEnt-JP (Nakayama et al., 2020)  In all experiments, we use English as the source language for our models.While English is not a low-resource language, it is the only one for which we have spoken captions.The low-resource setting with translations is thus a simulated setting.
To summarize, we have 8,000 images with 40,000 captions (five per image), in both English written and spoken form (amounting to ∼34 hours of speech).In addition, we have Japanese translations for two captions per image.
Validation and test sets are composed of 1,000 images from the original set each (with corresponding captions), using the split introduced in Karpathy and Fei-Fei (2015).The training set is composed of the 6,000 remaining images.
We additionally introduce a smaller version of the dataset available for experiments with English transcriptions (later referred to as the reduced English dataset), matching in size the one used for experiments with Japanese translations (i.e.keeping only the sentences that have a translation, even though we use transcriptions).
1 Items from F30kEnt-JP and Flickr8K were matched based on exact matches between the English written captions in both datasets.We also corrected for missing hyphens (e.g."red haired" and "red-haired" are considered the same), leaving us with 15,498 captions with Japanese transcription.

Pre-processing
Image features are extracted from the preclassification layer of a frozen ResNet-152 model (He et al., 2016) pretrained on ImageNet (Deng et al., 2009).We follow Merkx et al. (2019) and use features that are the result of taking the mean feature vector over ten crops of each image.
The acoustic feature vectors are composed of 12 MFCCs and log energy, with first and second derivatives, resulting in 39-dimensional vectors.They are computed over windows of 25 ms of speech, with 10 ms shift.

Repository
The code necessary to replicate our experiments is available under Apache License 2.0 at github.com/bhigy/textual-supervision.

Impact of the architecture
We first look at the performance of the different models trained with the full Flickr8K training set and English transcriptions.
As expected, using directly text as input, instead of speech, makes the task much easier.This is exemplified by the difference between the speechimage and text-image models in Table 1.
Table 2 reports the performance of the four core models.We can notice that both pipeline and multitask architectures can use the textual supervision to improve results over the text-less baseline, though, pipeline approaches clearly have an advantage with a R@10 of 0.642.Unlike what one could expect, training the pipeline system in a sequential way does not bring any improvement over independent training of the modules (at least with this amount of data).
Comparing the two multitask approaches, we see that using ASR as a secondary task (mtl-transcribe) is much more effective than using text as another input modality (mtl-match).
Table 3 also reports performance of the best pipeline and the best multitask model on the test set.For completeness, Appendix A.2 reports the same results as Tables 1, 2 and 3 with different metrics (R@1, R@5 and Medr).Appendix A.3 reports the performance on the ASR task.

Using translations
Tables 1 and 2 report the performance of the same models when trained with Japanese transcriptions.The scores are overall lower than results with English transcriptions, which can be explained by two factors: (i) the size of the dataset which is only ∼2/5 th of the original Flickr8K (as evidenced by the lower score of the text-less speech-image baseline) and (ii) the added difficulty introduced by the translation over transcriptions.Indeed, to translate speech, one first needs to recognize what is being said and then translate to the other language: thus translation involves many complex phenomena (e.g.reordering) which are missing from the transcription task.
While the four strategies presented in Table 2 improve over the speech-image baseline, their relative order differ from what is reported with English text.This time, the mtl-translate approach is the one giving the best score with a R@10 of 0.394, outperforming the pipeline systems (both performing similarly well in this context).
The difference in relative order of the models is likely the result of the degraded conditions (less data and harder task) impacting the translation task more severely than the speech-image retrieval task.The pipeline approaches, which rely directly on the output of the SLT component, are affected more strongly than the mtl-translate system where SLT is only a secondary target.This is in line with the results reported in the next section on downsampling the amount of textual data.
Table 3 reports performance of the best pipeline and the best multitask model on the test set.For completeness, Appendix A.2 reports the same results as Tables 1, 2 and 3 with different metrics (R@1, R@5 and Medr).Appendix A.3 reports the performance on the SLT task.

Disentangling dataset size and task factor
In an attempt to disentangle the effects of the smaller dataset and the harder task, we also report results on the reduced English dataset described in Section 3.4.1 (Table 1 and 2, 3 rd column) .Looking first at Table 2, we can see that both factors do indeed play a role in the drop in performance, though not to the same extent.Taking pipe-seq model as example, reducing the size of the dataset results in a 7% drop in R@10, while switching to translations further reduces accuracy by 42%.
An unexpected result comes from the text-image system (Table 1).Even though the model works on ground-truth text (no translation involved), we still see a 4% drop in R@10 between the reduced English condition and Japanese.This suggests that the models trained with Japanese translations are not only penalized by the translation task being harder, but also that extracting meaning from Japanese text is more challenging than from English (possibly due to a more complicated writing system).

Downsampling experiments
We now report on experiments that downsample the amount of textual supervision available while keeping the amount of speech and images fixed.We can see in Figure 3 (left) that, as the amount of transcribed data decreases, the score of the textimage and pipeline models progressively goes toward 0% of R@10.Between 11.3 and 3.8 hours of transcribed data, the text-image and pipe-ind models fall below the performance of the speech-image baseline.The pipe-seq is more robust and its performance stays higher (the best actually) until the amount of data goes below 3.8 hours of transcribed speech.After that the R@10 falls abruptly.This is likely the effect of the sequential training procedure allowing the pipe-seq to use all the speech available to train the text-image module (by transcribing it with the ASR module).Below 3.8 hours of speech, the quality of the transcriptions given by the ASR system deteriorates to the point that it is not usable anymore by the downstream component.
The two multitask approaches, on the other hand, progressively converge toward the speechonly baseline.The mtl-transcribe approach is overall giving better results than the mtl-match approach but fails to give a significant advantage over other sytems.It is only after the performance of the pipe-seq system abruptly decreases (from 1.3 hours of transcribed speech and below) that the mtltranscribe system can surpass this one, at which point it is already performing very close to the speech-image baseline.
Figure 3 (right) reports on the same set of experiments with Japanese translations.In this case too, the text-image, pipe-ind and pipe-seq models go toward 0% of R@10 as the amount of translated data decreases, while the mtl-translate and mtl-match systems converge toward the speech-image base-line.It seems though that 4.5 hours of translated data is not enough to see an improvement over the speech-image baseline with any of the models.
In this case, the pipe-seq model does not have a significant advantage over the pipe-ind model, likely due to the difficulty of the translation task.The same reason probably explains why the mtltranscribe strategy is performing the best on the full dataset (as reported in Section 4.2).However, while the pipeline architectures never surpass the mtl-translate model in the experiments reported, it may be the case with more data.
For completeness, Appendix A.3 reports the performance of on the ASR and SLT tasks themselves, for decreasing amounts of textual data.

Conclusion
In this paper, we investigated the use of textual supervision in visually-grounded models of spoken language understanding.We found that the improvements reported in Chrupała (2019) and Pasad et al. (2019) are a low estimate of what can be achieved when textual labels are available.Among the different approaches we explored, the more traditional pipeline approach, trained sequentially, is particularly effective and hard to beat with endto-end systems.This indicates that text is a very powerful intermediate representation.End-to-end approaches tend to perform better only when the amount of textual data is limited.
We have also shown that written translations are a viable alternative to transcriptions (especially for unwritten languages), though more data might be useful to compensate for the harder task.

Limitations and future work
We ran our experiments on Flickr8K dataset, which is a read speech dataset.We are thus likely underestimating the advantages of end-to-end approaches over pipeline approaches, in that they can use information present in speech (such as prosody) but not in text.Running experiments on a dataset with more natural and conversational speech could show end-to-end systems in a better light.
On the other end, we restricted ourselves to training the ASR and NLU components of the pipeline systems independently.Recent techniques such as Gumbel-Softmax (Jang et al., 2017) or the straightthrough estimator (Bengio et al., 2013) could be applied to train/finetune this model in an end-to-end fashion while still enforcing a symbolic intermediate representation similar to text.
In the same vein, it would be interesting to explore more generally whether and how an inductive bias could be incorporated in the architecture to encourage the model to discover such kind of symbolic representation naturally.

Figure 2 :
Figure 2: Architecture of the different models.

Table 1 :
Validation set R@10 of our two reference models when trained either with English transcriptions, a reduced version of the English dataset, or Japanese translations.

Table 2 :
Validation set R@10 of our four core models, when trained either with English transcriptions, a reduced version of the English dataset, or Japanese translations.

Table 3 :
Test set R@10 of the best pipeline (pipe-seq) and the best multitask (mtl-transcribe/mtl-translate) models, when trained with all English transcriptions or all Japanese translations.
Figure3: Results (R@10) of the different models on the validation set, when trained with decreasing amounts of English transcriptions (left) or Japanese translations (right).The total amount of speech available is kept identical, only the amount of translated data changes.