Catplayinginthesnow: Impact of Prior Segmentation on a Model of Visually Grounded Speech

The language acquisition literature shows that children do not build their lexicon by segmenting the spoken input into phonemes and then building up words from them, but rather adopt a top-down approach and start by segmenting word-like units and then break them down into smaller units. This suggests that the ideal way of learning a language is by starting from full semantic units. In this paper, we investigate if this is also the case for a neural model of Visually Grounded Speech trained on a speech-image retrieval task. We evaluated how well such a network is able to learn a reliable speech-to-image mapping when provided with phone, syllable, or word boundary information. We present a simple way to introduce such information into an RNN-based model and investigate which type of boundary is the most efficient. We also explore at which level of the network’s architecture such information should be introduced so as to maximise its performances. Finally, we show that using multiple boundary types at once in a hierarchical structure, by which low-level segments are used to recompose high-level segments, is beneficial and yields better results than using low-level or high-level segments in isolation.


Introduction and Prior Work
Visually Grounded Speech (VGS) models whether CNN-based (Harwath and Glass, 2015;Harwath et al., 2016;Kamper et al., 2017) or RNN-based (Chrupała et al., 2017;Merkx et al., 2019) became recently popular as they enable to model complex interaction between two modalities, namely speech and vision, and can thus be used to model child language acquisition, and more specifically lexical acquisition.Indeed, these models are trained to solve a speech-image retrieval task, which requires to identify lexical units that might be relevant in the spoken input, detect which objects are present in the image, and finally see if those objects match the detected spoken lexical units.Their task is thus very close to that of child learning its mother tongue, who is surrounded by a visually perceptible context and who tries to match parts of the acoustic input to surrounding visible situations.Research in language acquisition have put forward that children do not build their lexicon by segmenting the spoken input into phonemes and then building up words, but rather adopt a top-down approach (Bortfeld et al., 2005) and start by identifying and memorising whole words (Jusczyk and Aslin, 1995) or chunks of words (Bannard and Matthews, 2008) and then segment the spoken input into smaller units, such as phonemes.This suggests that the most efficient way of segmenting the spoken input to map a visual context to its description is at word level.From a more technological point of view, speech-based models lag behind their textual counterparts.For example, speech-image retrieval performs worse than text-image retrieval, despite being trained on the same data, the only changing factor being the modality where text or speech is used as a query.This begs the question: what makes text inherently better than speech for such applications?Is it because text is made up of already-segmented (discrete) units which lack internal variation, or because these discrete units (usually tokens) stand for full semantic units, or a combination of both?
Since the pioneering computational modelling work of lexical acquisition by Roy and Pentland (2002), neural network enabled an even tighter interaction between the visual and the audio modality.Recent work suggest that networks trained on a speech-image retrieval task perform an implicit segmentation of their input.Whether CNNbased approaches or RNN-based approaches are employed, all seem to segment individual words from the inputted spoken utterance (Harwath et al., 2016;Chrupała et al., 2017;Havard et al., 2019;Havard et al., 2019;Merkx et al., 2019).This result stands also for languages other than English, such as Hindi or Japanese (Harwath et al., 2018;Havard et al., 2019;Azuh et al., 2019;Ohishi et al., 2020).Chrupała et al. (2017) and Merkx et al. (2019) found out however that not all layers encode word-like units, suggesting that some layer specialise in lexical processing whereas some other do not encode such information.
Contributions Our research question can be framed as follows: what is the segmentation that maximises the performance of an audio-visual network if speech were to be segmented?To answer this question we investigate how it is possible to give speech boundary information to a neural network and explore which type of boundary (phone, syllable, or word) is the most efficient.We also explore where such information should be provided.That is, at which layer of the architecture is the addition of this information the most beneficial?
This paper is structured as follows: section 2 details our experimental material (data and model).Our contributions follow in sections 3 to 7. Finally, we conclude with a discussion in 8 and suggest future lines of work in section 9 2 Model & Data

Data
We use two different data sets in our experiments: MS COCO (Lin et al., 2014) and Flickr8k (Hodosh et al., 2013).Both corpora were initially conceived for computer vision purposes and both feature a set of images along with five written descriptions of the content of the images.The captions were not computer generated but rather written by humans.We use the audio extensions of both data set: for Flickr8k, we use the captions provided by Harwath and Glass (2015), and for MS COCO we use Sythetic COCO data set introduced by Chrupała et al. (2017); Chrupała et al. (2017).The captions of Harwath and Glass (2015) were gathered using Amazon Mechanical Turk and were thus uttered by humans.This data set is particularly challenging as it features multiple speakers and the quality of the recording is uneven from a caption to another.The spoken captions of (Chrupała et al., 2017) feature synthetic speech generated with Google's Text-to-Speech system.
Even though COCO uses a synthetic voice, this data set has several advantages over Flickr8k: the number of image/caption pair is much bigger than Flickr8k (600k vs. 40k) and COCO features only one synthetic voice.
For both corpora, we extracted speech- to-text alignments through the Maus forced aligner1 (Kisler et al., 2017) online platform, resulting in alignments at word and phone level.

Model Architecture and Training
Architecture The models we train in the experiments described in this article all have the same architecture and are based on that of Chrupała et al. (2017).2As all models of VGS, be they CNN-based or RNN-based, this architecture has two main components: an image encoder and a speech encoder.Such models are trained to solve a speech-image retrieval task, that is, given a query in the form of a spoken description, they should retrieve the closest matching image fitting the description.
The image encoder is a simple linear layer that reduces the pre-computed VGG image vectors to the desired dimension.The speech encoder, which receives MFCC vectors as input, consists of a 1D convolutional layer, followed by five stacked recurrent layers with residual connections, followed by an attention mechanism.We use uni-directional recurrent layers and not bi-directional recurrent layers even though it has been shown (Merkx et al., 2019) they lead to better results.Indeed, we aim at having a cognitively plausible model: humans process speech in a left-to-right fashion, as speech is being gradually uttered, and not from both ends simultaneously (and in fact, it would be impossible).
We use the same loss function as initially used by Chrupała et al. (2017): This contrastive loss function encourages the network to minimise the cosine distance d by a margin α between an image i and its corresponding utterance u, while maximising the distance between mismatching image/utterance pairs i 1 /u and i/u 1 .In our experiments we set α " 0.2.Hyperparameters For both COCO and Flickr8k models we use 1D convolutions with 64 filters of length 6 and a stride of 1 to preserve the original time resolution (and hence, boundary position). 3We use 512 units per recurrent layer for COCO and 1024 for Flickr8k.All models were trained using an Adam optimiser and a learning rate of 0.0002.
For our experiments we use the pre-computed MFCC vectors and pre-computed VGG vectors provided by Chrupała et al. (2017). 4We also use the same training, validation and testing splits.5 3 Integrating Segmentation Information

Boundary types
As previously stated, we are interested in supplying our network with linguistic information such as segment boundaries.We define a segment as either being a phoneme, a syllable, or a word.We consider two different types of syllables.Indeed, when we speak, words are not uttered one after the other in a disconnected fashion, but are rather blended together through a process called "resyllabification".In English, this phenomenon is visible when a word ending with a consonant is followed by a word starting with a vowel.In this case, the final consonant of the first word tends to be detached from it and attached to the next word, thus crossing the word boundary.This phenomenon is illustrated in Example (1) where phonemes in red indicate a resyllabification phenomenon.
(1) This is an article.Transcription6 /DIs#Iz#@n#AôtIk@l/ a.No resyllabification /DIs.Iz.@n.Aô.tI.k@l/ b.With resyllabification /DI.sI.z@.nAô.tI.k@l/For the rest of this article "syllables-word" will refer to syllables that result of a segmentation that does not take into account resyllabification (1-a), whereas "syllables-connect" will refer to syllables that result of a segmentation that takes into account resyllabification (1-b).It should be noted that in the syllable-connected condition, most word bound-aries are lost. 7In the syllable-word condition however all word boundaries are preserved and the segmentation inside a word may occasionally result in a morphemic segmentation (as for example in "runway" /ô2n.weI/ or "air.plane"/Eô.plein/).However this is not always the case, especially for longer words that are of non-germanic origin (such as "elephant" /E.lE.fant/ or "computer" /k@m.pju.t@~/).We expect models trained in the syllables-connected condition to perform worse than those trained in the syllables-word condition as resyllabification has been found to hinder word recognition (Vroomen and Gelder, 1999).
Segment boundaries were derived from the forced alignment metadata (see § 2.1) so as to indicate which MFCC vector constitutes a boundary or not.8Thus, for each caption we have a sequence

Integrating Boundary Information
In order to integrate boundary information into the network, we take advantage of how recurrent neural networks compute their output.Recurrent neural networks are particularly suitable when the input of the network is a sequence, such as in our case.Recurrent neural networks can be formalised as follows: where the hidden state at timestep t noted h t is a function f of the previous hidden state h t´1 and the current input at x t , with θ being learnable parameters of the function f .A special case arises at the very first time step t " 1 as h t´1 does not exist.In this case, the initial state h t´1 noted h 0 is set to be a vector of 0. The output of such a network at timestep T is thus dependent on all the previous timesteps.An illustration of such a network is depicted in Figure 1a.In this work, we use GRUs (Cho et al., 2014) but our methodology is applicable to any other type or recurrent cell such as vanilla RNNs or LSTMs.Figure 1a shows a Vanilla GRU.
Figure 1b shows GRU PACK. in the ALL condition where all the vectors produced at each time step are passed on to the next layer.1c shows GRU PACK. in the KEEP condition where only the last vector of a segment is passed on to the next layer, thus resulting in a output sequence shorter than the input sequence.The red crosses inscribed in a square ( ) signal that the output vector computed at a given timestep is not passed on to the next timestep and that the initial state h 0 is passed on instead.The red crosses inscribed in a circle ( ) signal that the output vector computed at a given timestep is not passed on to the next layer.Dotted line group vectors belonging to a same segment (either phone, syllable-connected, syllable-word, or word).Note that h 0 is only passed on to the next state at the end of a segment, thus effectively materialising a boundary by manually resetting the history.Also note that the x 1 , x 2 , ..., x t figured in this representation could either be the original input sequence (in our case, acoustic vectors) or could also be the output of the previous recurrent layer.
Our approach to integrate boundary information into the network can be formalised as follows: In our approach, h t is only dependant on the previous timetstep h t´1 if the previous timestep was not an acoustic vector corresponding to segment boundary (b t´1 ‰ 0).If the previous timestep corresponds to a segment boundary (b t´1 " 1), we reset the hidden state so that it is equal to h 0 .Hence, vectors in the same segment are temporally dependent, but vectors belonging to two different segments are not.The GRUs that use this computing scheme will from now on be referred to as GRU PACK., as vectors belonging to the same segment are "packed" together.We derived two different conditions from this initial setting: ALL and KEEP.In the ALL condition (see Figure 1b), all the vectors belonging to a segment are forwarded to the next layer (which can either be a recurrent layer, or an attention mechanism depending on the position of the GRU PACK.layer, see §2.2).In the KEEP condition, only the last vector of each segment is forwarded to the next layer (see Figure 1c).The length of the output and input sequence stays the same in the ALL condition.However, it should be noted that in the KEEP condition, the length of the output sequence is shorter than the input sequence.Potentially, the length of the sequences can be different for different items inside a batch as the captions have a different number of segments (be they phones, syllables or words).For this reason, and as the subsequent layers expect a 3D rectangular matrix,9 we add padding vectors on the sequence dimension until all the elements of the batch have the same sequence length.The difference between ALL and KEEP is motivated by the fact that we believe that keeping the last vector of a segment could constraint the network to build more consistent representations for different occurrences of the same segment, as the subsequent layers will have less information to rely on.A similar approach to ours was proposed by (Chen et al., 2019) in an Audio-Word2Vec experiment, where instead of being given gold segment boundaries, a classifier outputs a probability that a given frame constitute a segment boundary.

GRU PACK. Position
In order to understand where boundary information should be introduced (that is, at which level of the architecture), we train as many models as the number of recurrent layers, where each time one layer of GRUs is replaced with one GRU PACK.layer.For example, "GRU PACK.-3" refers to a model where the third layer of GRUs is a GRU PACK.layer and other layers (1 st ,2 nd ,4 th , and 5 th layer) are vanilla GRU layers.This setting will allow to explore where introducing boundary information is the most efficient.

Random Boundaries
In order to understand if introducing boundary information helps the network in its task, we compare the performance of the models using boundary information with a baseline model which does not use any (thus, all the recurrent layers of the baseline architecture are Vanilla GRU layers).This model will from now on be referred to as BASELINE.We also introduce another condition, where, instead of training models with real segment boundaries (which from now on will be referred to as TRUE), we train models with random boundaries (which from now on will be referred to as RANDOM).Indeed, it could be that randomly slicing speech into sub-units leads to better results, even though the resulting units do not constitute any linguistically meaningful units.Thus, training models with random boundaries will enable us to verify this claim.Random boundaries were generated by simply shuffling the position of the real boundaries (vector B introduced in §3.1), resulting in as many randomly positioned boundaries as they are real boundaries.Note that we do still expect the models to have reasonable results even when using random boundaries, as the acoustic vectors are kept untouched.However we expect that placing random boundaries will hinder the network's learning process and thus yield results significantly lower than when using true boundaries.We expect the results to be significantly lower in the RANDOM-KEEP condition as this condition is equivalent to randomly subsampling the input, and thus removing a lot of information.

Evaluation
Models are evaluated in term of Recall@k (R@k).Given a spoken query, R@k evaluates the models ability to rank the target paired image in the top k images.In order to evaluate if the results observed in our different experimental conditions (TRUE-ALL, TRUE-KEEP, RANDOM-ALL, RANDOM-KEEP) are different from one another and from the BASE-LINE condition, we used a two-sided proportion Z-Test.This test is used to test if there is a statistical difference between two independant proportions.As for each spoken query there is only one target image, R@k becomes a binary value which equals 1 if the target image is ranked in the top k images and 0 otherwise.In our case, the proportion that we test is the number of successes over the number of trials (which corresponds to the number of different caption/image pairs in the test set).

Results
Overall, our experimental settings led to the training of 81 different models per data set. 10BASE-  2 and 3 respectively.We obtain lower results on Flickr8k than on COCO which shows how difficult the task is on natural speech.Note that the results obtained on synthetic speech are also very low compared to their textual counterpart. 11

TRUE and RANDOM Boundaries
The first question our experiments aim at answering is whether introducing boundary information helps the network in solving its task or not.Overall, models trained with boundary information (either TRUE or RANDOM) have a better R@1 than their baseline counterparts.Surprisingly, introducing random boundaries may sometimes lead to statistically better results than the baseline, even though the resulting segments are linguistically meaningless units.We explain this by the fact that adding random boundaries, and thus randomly resetting the RNN's memory, adds noise and acts as a form of regularisation for the network.However, using TRUE boundaries, regardless of their type, yields overall better R@1 than RAN-DOM boundaries and BASELINE indicating that the models indeed effectively used the provided information.Table 2: Maximum R@1 (in %) for each model trained on the COCO data set."T" stands for TRUE (boundaries) and "R" stands for RANDOM (boundaries)."Syl-Co."and "Syl-Word" stand for "Syllable-Connected" and "Syllable-Word" respectively.Each line shows the results for when a specific recurrent layer is a GRU PACK.layer (see §4.1).The 1 st layer is the lowest layer (right after the 1D convolutions and acoustic vectors) and the 5 th the highest (right after the four preceding recurrent layers and before the attention mechanism).Highest R@1 in the

ALL and KEEP
Our experiments show that even though introducing boundary information and keeping all the vectors (Figure 1b) of a segment does lead to better R@1 than the baseline, keeping the last vector of a segment (Figure 1c) yields overall much better results.However, we may observe two different patterns depending on the data set used in the RAN-DOM-KEEP condition.For Flickr8k, keeping the last vector in the RANDOM condition worsens R@1 whereas for COCO we observe stable results or even slight improvements.This is to be explained by the fact that Flickr8k features real speech with multiple speakers and features high inter-and intraspeaker variation.Thus, using random boundaries which do not delimit meaningful linguistic units really hurts the performance of the network.However, COCO uses synthetic speech with only one voice and hence, has very low intra-speaker variation.Thus, even though we randomly subsample the input, as there is very little intra-speaker variation, the network is much more likely to figure out from which units the subsampled vector came from.In the ALL condition, the results between TRUE and RANDOM are only significant from one another and from the BASELINE for both COCO and Flickr8k at the lower layers showing that the network uses effectively this information only in the lower layers.Notice that no result is significant for Flickr8k on the upper layers and there is no significant difference between TRUE and RANDOM boundaries, therefore showing that boundary information in the ALL condition is not used effectively in the upper layers.Lack of statistical significance between RANDOM and TRUE in the ALL condition for COCO shows that only a regularisation effect explains why there is a significant difference with the baseline.Indeed, if it were the linguistic nature of the segmentation that explained such a difference, we should observe significant difference between TRUE boundaries and the BASELINE, which is not the case.

Phones, Syllables, or Words
In our experiments we used four different type of segments corresponding to different type of linguistic units: phones, syllables-connected, syllablesword, and words.These different type of segments vary in length (words and syllables are longer than phones), quantity (their are more phones and syllables than words), and intrinsic linguistic information: phones only show which are the basic acoustic units of the language, while word segments represent meaningful units, and syllable-word and syllable connected are a higher form of acoustic unit that may contain morphemic information.Given the task the network is trained for (speech-image retrieval), we do not expect these different units to perform equally well.Indeed, as this task implies mapping an image vector describing which objects are present in a picture and a spoken description of an image, we expect word-like segments (or segments that preserve word boundaries while bearing a substantial amount of semantical information) to perform better.This is in fact what we observe in practice.While phone-like units bring slight improvement over the baseline results, syllableword units (Flickr8k) and word units (COCO) obtain the highest results.It should be noted that syllable-connected segments obtain also statistically significant improvement over the baseline (GRU PACK.-2,3) despite not preserving all the word boundaries.However, these results are slightly worse than the syllable-word and word segments suggesting that preserving word boundaries is a property that helps the network.
It appears that the size of a segment is also a very important parameter.Indeed, phone segments (naturally) preserve word boundaries but of course naturally lack the internal cohesion of a morpheme or a word as nothing links two adjacent phonemes together.Thus, it seems that segments that preserve meaning (such as words) or from which meaning can be more easily recomposed (syllable) may facilitate the network's task.The fact that syllable-like segments perform as well as word segments might only by an artefact of using English where a high proportion of word is monosyllabic. 12Working on a language such as Japanese where the syllable-tomorpheme ratio is much higher would be a future line that would enable to test this hypothesis.

GRU PACK. Layer Position
We introduced boundary information at different levels of our architecture in order to better understand at which layer it is the most useful to add 12 Jespersen (1929) estimates that at least 8,000 commonly frequent words are monosyllabic in English such an information.We will focus in this section on the results obtained in the KEEP condition, as the ALL condition brings little improvement to the BASELINE condition.Figure 2, depicting the results obtained for Flickr8k in the RANDOM-KEEP and TRUE-KEEP condition, clearly show that introducing boundary information at different layers has a clear impact on the results: using such information at the first or the fifth layer is useless, as we notice it either yields similar results as the baseline (GRU PACK.-1) or significantly worsens the results regardless of the type of boundary used (GRU PACK.-5).When using syllable-word segments the best results are obtained when introducing the information at GRU PACK.-3.This results are exactly in line with that of (Chrupała et al., 2017) which found that the intermediate representations of the third and the fourth layer are the most informative in predicting the presence or the absence of a word.This confirms that the middle layers of our architecture deal with lexical units whereas the fifth layer encode information that disregards that type of information.All in all, word-like segments seem to be the most robust representation to be used as they yield the best results at three different layers (GRU PACK.-2,3,4).

Segmentation as a means for compression
Recall that in the KEEP condition, only the last vector comprising a segment is kept while the other vectors are discarded.This can be interpreted as a form of "guided" subsampling, as usually subsampling does not take into consideration linguistic factors.In order to understand how much information is kept between the input and the output of a GRU PACK.layer, we compute an average compression rate (in %) for each of the segment types for Flickr8k and COCO.The results are the following: phones = 90.57%(Flickr8k) and 89.87% (COCO), syllables-connected = 93.41%and 92.86%, syllables-word = 94.36% and 93.86%, and words = 94.90% and 94.50%.When we reanalyse our results in light of this information, it appears we can remove a large part of the original input (up to 94.90% if using word segments) while conserving or increasing the original R@1.A comparison between Figure 2a and Figure 2b shows it is not simply the effect of subsampling that helps, but subsampling with meaningful linguistic units.The effect of informed subsampling is striking when Figure 2: Maximum R@1 obtained on the Flickr8k data set in the RANDOM -KEEP condition (2a) and in the TRUE -KEEP condition (2b) grouped by GRU PACK.layer."ˇ" indicate a statistically significant difference (twosided Z-Test, p-value ă 1e ´2) of a two-proportion Z-Test between the R@1 of the baseline result and each bar (absence of "ˇ" show lack of statistically significant difference).Note how using random boundaries results in significantly worse R@1.
we compare R@1 for RANDOM, which are always below the BASELINE, while TRUE are on par with the BASELINE or better.This effect is particularly visible for GRU PACK.-1 in the RANDOM condition where the more vectors are discarded (syllable-like and word segments), the worse are the results.This can only be explained by the fact that randomly subsambling removes important information that the network is unable to recover in the four subsequent layers.
A counter-intuitive finding of our experiments is that it is better to subsample early on (in the first layer) and thus remove most of the information early on than later on.This is visible in Figure 2b: subsampling with word segments in GRU PACK.-2 (and thus only keeping 5.1% of the original amount of information for the subsequent layers) yields better results than subsampling with the same resolution at GRU PACK.-5.

Towards Hierarchical Segmentation
In our current approach, only one out of the five recurrent layers is a GRU PACK.layer, which handles only one type of segment.However, we can stack as many GRU PACK.as desired, provided they are supplied with boundary information.Stacking such layers enables us to not only integrate boundary information but also introduce structure, where one layer handles one type of segment (e.g.phone) and the following GRU PACK.layer handles another type of segment, that is hierarchically above the preceding (e.g.syllable, or word). 13 Harwath et al. (2019) explored such hierarchical architecture using a CNN-based model that incorporated vector quantisation layers and found that it improves R@k.Our work thus attempts to verify if it is also the case for an RNN-based model.
In this work we explore hierarchical segmentation with phones and words on the Flickr8k data set, and leave syllables for future work.We only consider the KEEP condition as it yields better results than the ALL condition.We vary the position of the GRU PACK.layers and test all positions where two GRU PACK.layers follow one another.For each configuration, the lowest GRU PACK.will receive phone boundary information, and the next layer will receive word boundary information.Note that such configuration results in a double sequence reduction.Indeed, after the first GRU PACK.layer, they are only as many output vectors as they are phones, and in the second, the resulting phone vectors are recomposed together to form words, resulting in as many output vectors as they are words.Results are shown in Table 4.
Training an architecture with two GRU PACK.Table 4: Left: Maximum R@1 (in %) on a speechimage retrieval task on Flickr8k using two GRU PACK.layers (left).Right: Summary of the results shown in Table 3 for phones and words for comparison purposes.
layers, each handling two different types of segments results in much better R@1 than the baseline (`4.2%) and than a single-GRU PACK.-layered architecture (`2.9%), thus showing that introducing hierarchy is beneficial.Results also confirm that the layer 2 and 3 of our architecture are those that benefit the most from adding linguistic information, and confirm the fact the upper layers (such as the fifth) do not take as much advantage of this information as the lower layers.

Discussion
The goal of our experiments is to see if segmenting speech in sub-units is beneficial, and if so, which units maximise the performance.It is indeed the case that segmenting speech into sub-units helps.
As to which segment obtains the best performance we observe mixed results.Indeed, word segmentation yields better results than phone segmentation, but we do also observe that syllable-like segmentation also gives results that are in the same ballpark as word segmentation.However, word segmentation seems to be a more robust representation compared to syllable as such word segments consistently yield better results at various levels of our architecture.Another finding of our experiments which we believe is important is that one cannot subsample speech without taking into account its linguistic nature.Indeed, random subsampling might yield better result than the baseline, but we showed this is only a regularisation effect.Linguistically informed subsampling yields however much better results and should be favoured.
Regarding the question of why textual approaches perform better than spoken approaches, we come to the conclusion that the fact that tokens stand for full semantic units plays little in their performance.The fact that text-based models use segmented input (either tokens are characters) also seem to play little in the final performance, otherwise we should have observed better results as our input was also segmented.What seems most crucial is that the representation of a token never changes whereas speech exhibits lots of variation, as no word is pronounced exactly in the same fashion when it is uttered.Our approach helped the network in building more consistent representation for the same word (especially in the KEEP condition, see Figure 3), even though it did not succeed for every word.Consistent representation across various occurrences seems to be the most important factor.
Our GRU PACK.setting also allowed us to simply introduce hierarchy in a neural network by simply stacking GRU PACK.layers and providing differ-ent boundary information to each of them.Our experiments allowed us to confirm the results obtained by Harwath et al. (2019) on a CNN-based VGS model, stating that introducing hierarchical structures proves beneficial overall.
In this work, we only explored phone+word hierarchy, but we could also imagine different types of hierarchical nestings.We could explore if there is any additional benefit if we integrate another type of acoustic unit such as syllables, resulting in a deeper hierarchy (phones+syllables+words).From a syntactic point of view, we could also integrate chunk boundaries and measure the impact of syntactical grouping of spoken units.Thus, GRU PACK.layers are flexible enough to model a wide range of linguistic units.

Future Work
The future lines of work we imagine all include learning where the boundaries are located instead of the network having to be supplied with boundary information at training and testing time.
A first approach that we imagine is a supervised approach.We could train an additional classifier that would predict if a vector computed at a given time step constitutes a boundary or not.Such an approach would allow to provide segment boundaries at training time only and not at testing time.
We could also imagine another solution that would be fully unsupervised.We could indeed use ACT-like recurrent cells (Kreutzer and Sokolov, 2018) or an architecture such as (Chen et al., 2019) that would dynamically and unsupervisedly learn how to segment the input signal into sub-units.Such a method would inform us on what is the best way to segment speech in order to solve our main task.The additional advantage of such methods is that they make no presupposition on the form/size of the segments, and thus on what a good segment should or should not be, but lets the network find the optimal solution.

Conclusion
In this paper, we introduced a simple way to integrate boundary information in a recurrent neural network by manually resetting the history.We showed that the type of boundary used (and hence resulting segment) has a significant impact on the results: segments that preserve word boundaries and that are long enough yield the best results.Our experiments also reveal that introducing boundary information at different level of the architecture may greatly impact the results.This also reveals what type of linguistic information is handled by the network at specific layers.We showed that linguistically informed subsampling yields better results than a subsampling approach that does not account for the linguistic nature of speech.Finally, we demonstrated that introducing structure by using hierarchical linguistic units proves useful in order to learn a better speech-to-image mapping.

Figure 1 :
Figure 1: Graphical representation of the different GRUs used in our experiments.Figure1ashows a Vanilla GRU.Figure1bshows GRU PACK. in the ALL condition where all the vectors produced at each time step are passed on to the next layer.1c shows GRU PACK. in the KEEP condition where only the last vector of a segment is passed on to the next layer, thus resulting in a output sequence shorter than the input sequence.The red crosses inscribed in a square ( ) signal that the output vector computed at a given timestep is not passed on to the next timestep and that the initial state h 0 is passed on instead.The red crosses inscribed in a circle ( ) signal that the output vector computed at a given timestep is not passed on to the next layer.Dotted line group vectors belonging to a same segment (either phone, syllable-connected, syllable-word, or word).Note that h 0 is only passed on to the next state at the end of a segment, thus effectively materialising a boundary by manually resetting the history.Also note that the x 1 , x 2 , ..., x t figured in this representation could either be the original input sequence (in our case, acoustic vectors) or could also be the output of the previous recurrent layer.
Figure3: t-SNE projections of the final vector of different occurrences of eigth randomly selected words (Flick8k) in the BASELINE condition (3a), in the ALL condition (3b), and in the KEEP condition (3c).Plot 3a shows that the representation learnt in the BASELINE condition are not word-based as the final vectors of different occurrences of the same word do not cluster together.In the ALL condition, the model overall fails to learn similar representations for the same occurrences of the same word (except for one word: "man"), despite being supplied with word boundaries.In the KEEP condition, the model succeeded in learning similar representations for different occurrences of the same word as words cluster together.

Table 1 :
Mean recalls at 1, 5, and 10 (in %) on a speechimage retrieval task COCO and Flickr8k in the BASE-LINE condition.Chance scores are 0.0002/0.001/0.002for COCO and 0.001/0.005/0.01 for Flickr8k.LINE results are shown in Table 1, results for the TRUE/RANDOM conditions obtained on the COCO and Flickr8k data sets are shown in Table table is shown in red.Best results between each TRUE and RANDOM pair (columnwise) are shown in bold.-indicate that the results are statistically better (respectively worse) than the baseline.Results in italics show statistical significance (two-sided Z-Test, p-value ă 1e ´2, see §4.3) between each TRUE and RANDOM pair (columnwise).
◌+ and ◌ + 4.0 4.7 4.2 4.9 + 4.3 5.5 + 4.3 Table 3: Maximum R@1 (in %) for each model trained on the Flickr8k data set.The same naming conventions of Table 2 are used for this table.A graphical representation of the results in the RANDOM-KEEP and TRUE-KEEP condition is shown in Figure 2.