Generating Image Descriptions via Sequential Cross-Modal Alignment Guided by Human Gaze

When speakers describe an image, they tend to look at objects before mentioning them. In this paper, we investigate such sequential cross-modal alignment by modelling the image description generation process computationally. We take as our starting point a state-of-the-art image captioning system and develop several model variants that exploit information from human gaze patterns recorded during language production. In particular, we propose the first approach to image description generation where visual processing is modelled $\textit{sequentially}$. Our experiments and analyses confirm that better descriptions can be obtained by exploiting gaze-driven attention and shed light on human cognitive processes by comparing different ways of aligning the gaze modality with language production. We find that processing gaze data sequentially leads to descriptions that are better aligned to those produced by speakers, more diverse, and more natural${-}$particularly when gaze is encoded with a dedicated recurrent component.


Introduction
Describing an image requires the coordination of different modalities. There is a long tradition of cognitive studies showing that the interplay between language and vision is complex. On the one hand, eye movements are influenced by the task at hand, such as locating objects or verbally describing an image (Buswell, 1935;Yarbus, 1967). On the other hand, visual information processing plays a role in guiding linguistic production (e.g., Griffin, 2004;Gleitman et al., 2007). Such cross-modal coordination unfolds sequentially in the specific task of image description (Coco and Keller, 2012)-i.e., objects tend to be looked at before being mentioned. Yet, the temporal alignment between the two modalities is not straightforward (Griffin and Bock, 2000;Vaidyanathan et al., 2015) Figure 1: In our approach, an image captioning model is fed with a sequence of masked images encoding the gaze fixations of a single human speaker during language production. This diagram is a toy illustration.
In this paper, we follow up on these findings and investigate cross-modal alignment in image description by modelling the description generation process computationally. We take a state-of-the-art system for automatic image captioning (Anderson et al., 2018) and develop several model variants that exploit information derived from eye-tracking data. To train these models, we use a relatively small dataset of image descriptions in Dutch (DIDEC; van Miltenburg et al., 2018) that includes information on gaze patterns collected during language production. We hypothesise that a system that encodes gaze data as a proxy for human visual attention will lead to better, more human-like descriptions. In particular, we propose that training such a system with eye-movements sequentially aligned with utterances (see Figure 1) will produce descriptions that reflect the complex coordination across modalities observed in cognitive studies.
We develop a novel metric that measures the level of semantic and sequential alignment between descriptions and use it in two ways. First, we analyse cross-modal coordination in the DIDEC data, finding that the product of content and sequentiality better captures cross-modal correlations than content alone. Second, we test whether our models generate captions that capture sequential alignment. Our experiments show that exploiting gaze-driven attention helps enhance image caption generation, and that processing gaze patterns sequentially results in descriptions that are better aligned with those produced by speakers, as well as being more diverse-both in terms of variability per image and overall vocabulary-particularly when gaze is encoded with a dedicated recurrent component that can better capture the complexity of the temporal alignment across modalities. Our data and code are publicly available at https: //github.com/dmg-illc/didec-seq-gen.
Overall, this work presents the first computational model of image description generation where both visual and linguistic processing are modelled sequentially, and lends further support to cognitive theories of sequential cross-modal coordination.

Related Work
Image captioning Various models have been proposed to tackle the challenging task of generating a caption for a visual scene (Bernardi et al., 2016). Contemporary approaches make use of deep neural networks and encoder-decoder architectures (Sutskever et al., 2014). In the influential model by Vinyals et al. (2015), a Convolutional Neural Network (CNN) is used to encode the input image into a feature representation, which is then decoded by a Long Short-Term Memory network (LSTM; Hochreiter and Schmidhuber, 1997) that acts as a generative language model. In recent years, there have been many proposals to enhance this basic architecture. For instance, via extracting features from a lower layer of a CNN, Xu et al. (2015) obtain representations for multiple regions of an image over which attention can be applied by the LSTM decoder. The 'Bottom-up and Top-down Attention' model by Anderson et al. (2018) further refines this idea by extracting multiple image features with the help of Faster R- CNN (Ren et al., 2015), which results in the ability to focus on regions of different sizes better aligned with the objects in the image. Other models based on unsupervised methods (e.g., Feng et al., 2019) and Generative Adversarial Networks (Chen et al., 2019) have also been proposed recently.
We take as our starting point the model by Anderson et al. (2018) for two main reasons: first, it is among the best-performing architectures on standard image captioning benchmarks; second, its underlying idea (i.e., bottom-up and top-down attention) is explicitly inspired by human visual attention mechanisms (Buschman and Miller, 2007), which makes it suitable for investigating the impact of adding human gaze information.
Eye tracking In computer vision, human eye movements collected with eye-tracking methods have been exploited to model what is salient in an image or video for object detection (Papadopoulos et al., 2014), image classification (Karessli et al., 2017), image segmentation (Staudte et al., 2014), region labelling (Vaidyanathan et al., 2015(Vaidyanathan et al., , 2018, and action detection (Vasudevan et al., 2018). More relevant for the present study, gaze has also been used in automatic description generation tasks, such as video frame captioning (Yu et al., 2017) and image captioning (Sugano and Bulling, 2016;Chen and Zhao, 2018;He et al., 2019). In all these approaches, gaze data from different participants is aggregated into a static saliency map to represent an abstract notion of saliency. This aggregated gaze data is used as supervision to train models that predict generic visual saliency.
In contrast, in our approach, we model the production process of a single speaker by directly inputting information about where that speaker looks at during description production, and compare this to the aggregation approach. In addition, we exploit the sequential nature of gaze patterns, i.e., the so-called scanpath, and contrast this with the use of static saliency maps. Gaze scanpaths have been used in NLP for diverse purposes: For example, to aid part-of-speech tagging (Barrett et al., 2016) and chunking (Klerke and Plank, 2019); to act as a regulariser in sequence classification tasks (Barrett et al., 2018); as well as for automatic word acquisition (Qu and Chai, 2008) and reference resolution (Kennington et al., 2015). To our knowledge, the present study is the first attempt to investigate sequential gaze information for the specific task of image description generation.

Data
We utilise the Dutch Image Description and Eye-Tracking Corpus (DIDEC; van Miltenburg et al., 2018). In particular, we use the data collected as part of the description-view task in DIDEC, where participants utter a spoken description in Dutch for each image they look at. The gaze of the participants is recorded with an SMI RED 250 eyetracking device while they describe an image. Over-all, DIDEC consists of 4604 descriptions in Dutch (15 descriptions per image on average) for 307 MS COCO images (Lin et al., 2014). For each description, the audio, textual transcription, and the corresponding eye-tracking data are provided.

Preprocessing
We tokenise the raw captions, lowercase them, and exclude punctuation marks and information tokens indicating, e.g., repetitions (<rep>). We then use CMUSphinx 1 to obtain the time intervals of each word given an audio file and its transcription. See Appendix A for more details.
Gaze data in DIDEC is classified into gaze events such as fixations, saccades or blinks. We discard saccades and blinks (since there is no visual input during these events) and use only fixations that fall within the actual image. We treat consecutive occurrences of such fixations as belonging to the same fixation window.

Saliency maps
Using the extracted fixation windows, we create two types of saliency maps, aggregated and sequential, which indicate the prominence of certain image regions as signalled by human gaze.
Aggregated saliency maps (per image) The aggregated saliency map of an image is computed as the combination of all participants' gazes and represents what is generally prominent given the image description task. To create it, we first compute the saliency map of each participant who looked at the given image. Following Coco and Keller (2015), for each fixation window of the participant, we create a Gaussian mask centered at the window's centroid with a standard deviation of 1 • of visual angle. Given the data collection setup of DIDEC, this standard deviation corresponds to 44 pixels. We sum up the masks weighted by relative fixation durations and normalise the resulting mask to have values in the range [0, 1]. Finally, we sum up and normalise the maps of all relevant participants to obtain the aggregated saliency map per image.
Sequential saliency maps (per image-participant pair) A sequential saliency map consists of a sequence of saliency maps aligned with the words in a description, and represents the scan pattern of a given participant over the course of description production. Using the temporal intervals ex-tracted from the audio files, we align each word with the image regions fixated by the participant right before the word was uttered. For each word w t -using the same method described above for aggregated maps-we combine all the fixation windows that took place between w t−1 and the onset of w t and normalise them to obtain a word-level saliency map. 2 This way, we obtain a sequence of saliency maps per participant description.

Masked images and image features
The saliency maps are used to keep visible only the image regions that were highly attended by participants and to mask the image areas that were never or rarely looked at (see Figure 1). We create each masked image by calculating the elementwise multiplication between the corresponding 2D saliency map and each RGB channel in the original image. We then extract image features from the masked images using ResNet-101 (He et al., 2016) pre-trained on ImageNet (Deng et al., 2009). We take the output of the 2048-d average pooling layer as the image features to give as input to our models.

Evaluation Measures
We propose a novel metric to quantify the degree of both semantic and sequential alignment between two sentences. In our study, this metric will be leveraged in two ways: (1) to analyse cross-modal coordination in the DIDEC data (Section 5) and (2) to evaluate our generation models (Section 7). For context, we first briefly review several existing metrics for automatic image captioning.
Image Captioning metrics Image caption generation is evaluated by assessing some kind of similarity between the generated caption and one or more reference captions (i.e., those written by human annotators). One of the most commonly used metrics is CIDEr (Vedantam et al., 2015), which (a) computes the overlapping n-grams between the generated caption and the entire set of reference sentences for a given image, and (b) downweighs n-grams that are frequent in the entire corpus via tf-idf scores. Thus-regarding semantics and sequentiality-CIDEr scores can be affected by word order permutations, but not by the relative position of words in the entire caption nor by the presence of different but semantically similar words. Other metrics such as BLEU (which looks at n-gram precision; Papineni et al., 2002) and ROUGE-L (which considers n-gram recall; Lin, 2004) suffer from comparable limitations.
METEOR (Banerjee and Lavie, 2005) and SPICE (Anderson et al., 2016) also make use of n-grams (or tuples in a scene's graph, in the case of SPICE) and take into account semantic similarity by matching synonyms using WordNet (Pedersen et al., 2004). This allows for some flexibility, but can be too restrictive to grasp overall semantic similarity. To address this, Kilickaya et al. (2017) proposed using WMD, which builds on word2vec embeddings (Mikolov et al., 2013); more recently, several metrics capitalising on contextual embeddings (Devlin et al., 2019) were proposed, such as BERTScore (Zhang et al., 2020) and Mover-Score (Zhao et al., 2019). However, these metrics neglect the sequential alignment of sentences. 3 SSD We propose Semantic and Sequential Distance (SSD), a metric which takes into account both semantic similarity and the overall relative order of words. Regarding the latter, SSD is related to Ordering-based Sequence Similarity (OSS; Gómez-Alonso and Valls, 2008), a measure used by Coco and Keller (2010) to compare sequences of categories representing gaze patterns. 4 Given two sequences of words, i.e., a generated sentence G and a reference sentence R, SSD provides a single positive value representing the overall dissimilarity between G and R: the closer the value to 0, the higher the similarity between the two sentences (note that the value is unbounded). This single value is the average of two terms, gr and rg, which quantify the overall distance between G and R-the sum of their cosine (cos) and positional (pos) distance-from G to R and from R to G, respectively. The equation for gr is given below: where R s (i) is the semantically closest element to G i in R, and cos in our experiments is computed over word2vec embeddings trained on the 4Btoken corpus in Dutch, COW (Tulkens et al., 2016). Computation of gr (Eq. 1). Sums below each word in G stand for cos + pos, darker shades of orange for higher cos distance. Value of gr is the sum of numbers in red (here 3.76). Best viewed in color. Figure 2 illustrates how the metric works in practice. Full details are in Appendix B. For simplicity, the diagram only shows the computation in the gr direction. For example, consider the second element in G, 'lovely'. Its closest embedding in R is 'nice' (cos = 0.33). For each of these elements, we retrieve their position index (i.e., 2 for 'lovely' in G and 6 for 'nice' in R), compute their positional distance, and normalise it by the length of the longest sentence in the pair (here R), obtaining |2 − 6|/9 ≈ 0.44. We then sum up the cosine distance and the positional distance to obtain a score for 'lovely': 0.33 + 0.44 = 0.77. To obtain the overall gr value, we add up the scores for all words in G. We compute rg in a similar manner and obtain SSD as follows: SSD = (gr + rg)/2.

Cross-Modal Coordination Analysis
To empirically motivate our generation models, as a preliminary experiment we investigate the level of coordination between visual attention and linguistic production in the DIDEC dataset. In particular, we test whether scanpath similarity and sentence similarity are correlated and whether taking into account the sequential nature of the two modalities results in higher cross-modal alignment.
We transform gaze data into time-ordered sequences of object labels, i.e., scanpaths, (e.g., S = 'cat', 'person', 'cat', 'table') using the annotations of object bounding boxes in the MS COCO image dataset. On average, scanpaths have a length of 23.4 object labels. As for captions, we simply take the full sentences and treat them as sequences of words (e.g., C = 'a cute cat cuddled by a boy').
Descriptions contain an average of 12.8 tokens.
Order-sensitive analysis (sequential) For each image, we take the set of produced descriptions and compute all pairwise similarities by using SSD (see Section 4). Similarly, we take the corresponding scanpaths and compute all pairwise similarities by using OSS (Gómez-Alonso and Valls, 2008). We then calculate Spearman's rank correlation (onetailed) between the two similarity lists. This way, we obtain a correlation coefficient and p-value for each of the 307 images in the dataset.
Bag of Words analysis (BoW) We compare the correlation observed in the order-sensitive analysis with a BoW approach. Here, we represent a sentence as the average of the word2vec embeddings of the words it contains and a scanpath as a term-frequency vector. We then perform the same correlation analysis described above.
Random baseline (random) As a sanity check, using the stricter order-sensitive measures, for each image we re-compute the correlation between the two lists of similarities after randomly shuffling the sentences and corresponding scanpaths per image. We repeat this analysis 3 times.
Results As shown in Table 1, the highest level of alignment is observed in the sequential condition, where a significant positive correlation between scanpath and sentence similarities is found for 81 images out of 307 (26%). In BoW, the level of alignment is weaker: a positive correlation is found for 73 images (24%), with lower maximum correlation coefficients (0.65 vs. 0.49). Substantially weaker results can be seen in the random condition. These outcomes are in line with those obtained by Coco and Keller (2012) in a small dataset of 576 English sentences describing 24 images.
Overall, the results of the analysis indicate that the product of content and sequentiality better captures the coordination across modalities compared to content alone. Yet, the fact that positive correlations are present for only 26% of the images suggests that coordination across modalities is (not surprisingly) more complex than what can be captured by the present pairwise similarity computation, confirming the intricacy of the cross-modal temporal alignment (Griffin and Bock, 2000;Vaidyanathan et al., 2015). We take this aspect into account in our proposed generation models.

Models
The starting point for our models is the one by Anderson et al. (2018). 5 The main aspect that dis-5 The original implementation of this model can be found at: https://github.com/peteanderson80/  , 2015) as image encoder, which identifies regions of the image that correspond to objects and are therefore more salient-the authors refer to this type of saliency detection as "bottom-up attention". Each object region i is transformed into an image feature vector v i . The set of region vectors {v 1 , . . . , v k } is utilised in two ways by two LSTM modules: The first LSTM takes as input the mean-pooled image feature v (i.e., the mean of all salient regions) at each time step, concatenated with the two standard elements of a language model, i.e., the previous hidden state and an embedding of the latest generated word. The hidden state of this first LSTM is then used by an attention mechanism to weight the vectors in {v 1 , . . . , v k }-the authors refer to this kind of attention as "top-down". Finally, the resulting weighted average feature vectorv t is given as input to the second LSTM module, which generates the caption one word at a time. Note that the set of region vectors {v 1 , . . . , v k } and the meanpooled vector v are constant over the generation of a caption, while the weights over {v 1 , . . . , v k } and hence the weighted average feature vectorv t do change dynamically at each time step since they are influenced by the words generated so far.
We take the original model as our baseline and modify it to integrate visual attention defined by gaze behaviour. In particular, we replace the meanpooled vector v by a gaze vector g computed from masked images representing fixation patterns as explained in Section 3. We do not directly modify the set of object regions {v 1 , . . . , v k } present in the original model (i.e., bottom-up attention is still present in our proposed models). However, the top-down attention weights learned by the models bottom-up-attention. We developed our models building on the PyTorch re-implementation of the model available at: https://github.com/poojahira/ image-captioning-bottom-up-top-down. are influenced by the gaze patterns given as input.
Concretely, we test the following model conditions: • NO-GAZE: The original model as described above, with exactly the same image feature vectors used by Anderson et al. (2018).
• GAZE-AGG: The mean-pooled vector v in the original model is replaced with a gaze image vector g computed on the image masked by the aggregated gaze saliency map. As explained in Section 3.2, this corresponds to the combination of all participants' fixations per image and hence remains constant over the course of generation.
• GAZE-SEQ: As depicted in Figure 3, we replace v with g t , which are features computed for the image that was masked by the participantspecific sequential gaze saliency map at time t. Hence, g t differs at each time step t. Building on the results of the correlation analysis, this sequential condition thus offers a model of the production process of a speaker where visual processing and language production are timealigned.
• GAZE-2SEQ: Cross-modal coordination processes seem to go beyond simplistic content and temporal alignment (Griffin and Bock, 2000;Vaidyanathan et al., 2015). To allow for more flexibility, we add an extra gaze-dedicated LSTM component (labelled 'Gaze LSTM' in Figure 3), which processes the sequential gaze vector g t and produces a hidden representation h g t . This dynamic hidden representation goes through a linear layer and then replaces v at each time step t.
For the three GAZE models, we also considered a version where v is concatenated with g or g t as appropriate, rather than being replaced by the gaze vectors. Since they did not bring in better results, we do not discuss them further in the paper.

Experiments
We experiment with the proposed models using the DIDEC dataset and report results per model type.

Setup
We randomly split the DIDEC dataset at the image level, using 80% of the 307 images for training, 10% for validation, and 10% for testing. Further details are available in Appendix C.
Pre-training Since DIDEC is a relatively small dataset, we pre-train all our models using a translated version of train/val annotations of MS COCO 2017 version. We translated all the captions in the training and validation sets of MS COCO from English to Dutch using the Google Cloud Translation API. 6 We exclude all images present in our DIDEC validation and test sets from the training set of the translated MS COCO. We randomly split the original MS COCO validation set into validation and test. The final translated dataset in Dutch used for pre-training includes over 118k images for training, and 2.5k images for validation and testing, respectively, with an average of 5 captions per image.
Manual examination of a subset of translated captions showed that they are of good quality overall. Indeed, pre-training the NO-GAZE model with the translated corpus results in an improvement of about 21 CIDEr points (from 40.81 to 61.50) in the DIDEC validation set. Given that the MS COCO dataset is comprised of written captions compared to DIDEC, which includes spoken descriptions, these two datasets can have distinct characteristics. We expect the transfer learning approach to help mitigate this by allowing our models to learn the  features of spontaneous spoken descriptions during the fine-tuning phase. All results reported below were obtained with pre-training (i.e., by initialising all models with the weights learned by the NO-GAZE model on the translated dataset and then fine-tuning on DIDEC).

Vocabulary and hyperparameters
We use a vocabulary of 21,634 tokens consisting of the union of the entire DIDEC vocabulary and the translated MS COCO training set vocabulary. For all model types, we perform parameter search focusing on the learning rate, batch size, word embedding dimensions and the type of optimiser. The reported results refer to models trained with a learning rate of 0.0001 optimising the Cross-Entropy Loss with the Adam optimiser. The batch size is 64. The image features have 2048 dimensions and the hidden representations have 1024. The generations for the validation set were obtained through beam search with a beam width of 5. Best models were selected via either SSD or CIDEr scores on the validation set, with an early-stopping patience of 50 epochs.
More information regarding reproducibility can be found in Appendix D.

Results
The results obtained with different models are shown in Table 2. We report results on the test set, averaging over 5 runs with different random seeds. These scores are obtained with the best models selected on the validation set with either SSD or CIDEr. 7 For reference, we also include scores for other metrics not used for model selection. This allows us to check whether scores for other metrics are reasonably good when the models are optimised for a certain metric; however, only scores in the shaded columns allow us to extract conclusions on the relative performance of different model types. 7 We use the library at https://github.com/ Maluuba/nlg-eval to obtain corpus-level BLEU and CIDEr scores.
On average, the best GAZE models outperform the NO-GAZE model: 5.81 vs. 5.86 for SSD (lower is better) and 55.74 vs. 52.45 for CIDEr (higher is better). This indicates that eye-tracking data encodes patterns of attention that can contribute to the enhancement of image description generation. Zooming into the different gaze-injected conditions, we find that among the models selected with SSD, the sequential models perform better than . This shows that the proposed models succeed (to some extent) in capturing the sequential alignment across modalities, and that such alignment can be exploited for description generation. Interestingly, GAZE-2SEQ is the best-performing gaze model: it has the best average SSD across runs and the best absolute single run (5.70 vs. 5.79 and 5.80 by GAZE-SEQ and GAZE-AGG, respectively). This suggests that the higher flexibility and abstraction provided by the gaze-dedicated LSTM component offers a more adequate model of the intricate ways in which the two modalities are aligned.
As for the CIDEr-selected models, on average the gaze-injected models also perform better than NO-GAZE. The best results are obtained with . This is consistent with what CIDEr captures: it takes into account regularities across different descriptions of a given image; therefore, using a saliency map that combines the gaze patterns of several participants leads to higher scores than inputting sequential saliency maps, which model the path of fixations of each speaker independently. This variability seems to have a negative effect on CIDEr scores of sequential models, which are lower than GAZE-AGG; yet higher than .
It is worth noting that CIDEr and BLEU-4 scores obtained with the SSD-selected models are sensible, which indicates that the generated descriptions do not suffer with respect to distinct aspects evaluated by other metrics when the models are optimised with SSD. Indeed, the highest CIDEr score specificity disfluency compression repetition NO-G een vrouw die in de keuken staat. . . een foto van een straat met een aantal vogels een rode bus en een bus een straat met auto's en auto's (a woman who is standing in the kitchen. . . ) (a photo of a street with a number of birds) (a red bus and a bus) (a street with cars and cars) 2SEQ een vrouw in een keuken met donuts uh uh uh uh met een aantal vogels twee bussen die geparkeerd staan een straat in de stad met auto's en auto's (a woman in the kitchen with donuts) (uh uh uh uh with some birds) (two buses that are parked) (a street in the city with cars and cars) Figure 4: Phenomena that are either particular to gaze models (specificity, disfluency, and compression) or common to all (repetition). Abbreviations NO-G and 2SEQ refer to NO-GAZE and GAZE-2SEQ, respectively.
obtained among models selected via SSD (GAZE-SEQ: 56.16) is even higher than that obtained by the best CIDEr-selected one (GAZE-AGG: 55.74). However, this is likely due to CIDEr being sensitive to lexical differences between the test set and the validation set used for model selection, which could lead to slightly different patterns.

Analysis
This section presents an analysis of the descriptions generated by the models on the test set (446 descriptions). We focus on one single run per model.

Cross-modal sequential alignment
Given what SSD captures, our results indicate that the captions generated by GAZE-2SEQ are better aligned-in terms of semantic content and order of wordswith the human captions than the ones generated by non-sequential models. Arguably, this enhanced alignment is driven by the specific information provided by the scanpath of each speaker. If this information is used effectively by the sequential models, then we should see more variation in their output. By definition, the non-sequential models generate only one single caption per image. Are the sequential models able to exploit the variation stemming from the speaker-specific scanpaths? Indeed, we find that GAZE-2SEQ generates an average of 4.4 different descriptions per image (i.e., 30% of the generated captions per image are unique). Furthermore, we conjecture that tighter coordination between scanpaths and corresponding descriptions should give rise to more variation, since presumably the scanpath has a stronger causal effect on the description in such cases. To test this, we take the 30 images in the test set and divide them into two groups: (A) images for which a significant positive correlation was found in the cross-modal coordination analysis of Section 5; (B) all the others. These groups include, respectively, 10 and 20 images. As hypothesised, we observe a higher percentage of unique captions per image in A (35%) compared to B (27%).
Quantitative analysis We explore whether there are any quantitative differences across models regarding two aspects, i.e., the average length in tokens of the captions, and the size of the vocabulary produced. No striking differences are observed regarding caption length: NO-GAZE produces slightly shorter captions (avg. 7.5) compared to both GAZE-2SEQ (avg. 7.7) and GAZE-AGG (avg. 8.1). The difference, however, is negligible. Indeed, it appears that equipping models with gaze data does not make sentence length substantially closer to the length of reference captions (avg. 12.3 tokens).
In contrast, there are more pronounced differences regarding vocabulary. While GAZE-AGG has a similar vocabulary size (68 unique tokens produced) to NO-GAZE (63), GAZE-2SEQ is found to almost double it, with 109 unique tokens produced. Though this number is still far from the total size of the reference vocabulary (813), this trend suggests that a more diverse and perhaps 'targeted' language is encouraged when specific image regions are identified through gaze-based attention.
The following qualitative analysis sheds some light on this hypothesis.
Qualitative analysis Manual inspection of all the captions generated by the models reveals interesting qualitative differences. First, captions generated by gaze-injected models are more likely to refer to objects-even when they are small and/or in the background-which are image-specific and thus very relevant for the caption. For example, when describing the leftmost image in Fig. 4, NO-GAZE does not mention the word donuts, which is produced by both GAZE-AGG and GAZE-2SEQ. Second, gaze-injected models produce language that seems to reflect uncertainty present in the visual input. For the second image of Fig. 4, e.g., both GAZE-AGG and GAZE-2SEQ generate disfluencies such as uh (interestingly, several participants' descriptions include similar disfluencies for this same image, which suggests some degree of uncertainty at the visual level); in contrast, in the entire test set no disfluencies are produced by NO-GAZE. Finally, we find that GAZE-2SEQ is able to produce captions that somehow 'compress' a repetitive sequence (e.g., a red bus and a bus) into a shorter one, embedding a number (e.g., two buses that are parked; see third example in Fig. 4). This phenomenon is never observed in the output of other models (crucially, not even in GAZE-SEQ). We thus conjecture that this ability is due to the presence of the gaze-dedicated LSTM, which allows for a more abstract processing of the visual input. However, the presence of gaze data does not fully solve the issue of words being repeated within the same caption, as illustrated by the rightmost example in Fig. 4. Indeed, this weakness is common to all models, including the best performing GAZE-2SEQ.

Conclusions
We tackled the problem of automatically generating an image description from a novel perspective, by modelling the sequential visual processing of a speaker concurrently with language production. Our study shows that better descriptions-i.e., more aligned with speakers' productions in terms of content and order of words-can be obtained by equipping models with human gaze data. Moreover, this trend is more pronounced when gaze data is fed sequentially, in line with cognitive theories of sequential cross-modal alignment (e.g., Coco and Keller, 2012).
Our study was conducted using the Dutch language dataset DIDEC (van Miltenburg et al., 2018), which posed the additional challenges of dealing with a small amount of data and a low resource language. We believe, however, that there is value in conducting research with languages other than English. In the future, our approach and new evaluation measure could be applied to larger eyetracking datasets, such as the English dataset by He et al. (2019). Since different eye-tracking datasets tend to make use of different gaze encodings and formats, the amount of pre-processing and analysis steps required to apply our method to other resources was beyond the scope of this paper. We leave testing whether the reported pattern of results holds across different languages to future work.
Despite the challenges mentioned above, our experiments show that a state-of-art image captioning model can be effectively extended to encode cognitive information present in human gaze behaviour. Comparing different ways of aligning the gaze modality with language production, as we have done in the present work, can shed light on how these processes unfold in human cognition. This type of computational modelling could help, for example, study the interaction between gaze and the production of filler words and repetitions, which we have not investigated in detail. Taken together, our results open the door to further work in this direction and support the case for computational approaches leveraging cognitive data. Alfred L Yarbus. 1967

Appendices A Audio-Caption Alignment
In this appendix, we provide details on the pipeline used to time-align audio and descriptions. After processing a transcribed caption, we insert it as a grammar rule into a Java Speech Grammar Format (JSGF) file to be fed into CMUSphinx. As CMUSphinx supports English by default, we incorporated into the tool the phonetic and language models and the dictionary for Dutch as provided by the developers of CMUSphinx. 8 Some words in our JSGF files were not in the VoxForge Dutch phonetic dictionary of CMUSphinx, which lists lexical items and their corresponding pronunciations in a format similar to ARPABET, adapted for Dutch. 9 To overcome this problem, we used eSpeak 10 to obtain the International Phonetic Alphabet (IPA) transcriptions of such out-of-vocabulary words. We obtained the set of IPA symbols existing in the transcriptions of out-of-vocabulary words and the set of ARPA-BET symbols in the dictionary. Then, a native speaker of Dutch, who is also a linguist, manually produced a mapping from these IPA symbols to 8 https://sourceforge.net/projects/ cmusphinx/files/Acoustic%20and% 20Language%20Models/Dutch/ 9 http://www.speech.cs.cmu.edu/cgi-bin/ cmudict 10 http://espeak.sourceforge.net/ dit is een treinstation waar ... ARPABET symbols of Dutch phonemes. 11 Given this mapping, we automatically converted out-ofvocabulary tokens into the required format and appended to the dictionary. A similar approach was also followed for numbers in numeric notation and certain English words. For some audio-caption pairs, the tool could not find an alignment matching the grammar. We turned off noise-and silence-removal and experimented with parameters related to beam-decoding in CMUSphinx to allow for a maximal number of complete alignments. However, we had to exclude some captions where there were unintelligible words in particular at the beginning or in the middle of the audio, since such an issue disrupts the alignment procedure.
Considering possible inter-participant differences in terms of pronunciation, the quality of audio files, and possible noise in the background of recordings, we assume that the time intervals of the words we obtained after these pre-processing steps are approximate indicators. Although there might be a few cases where the alignment is not quite accurate, we find this way of obtaining utterance timestamps reliable in general. An example audio-caption alignment is shown in Figure 5.

B SSD: Further Details
SSD is the average of two terms, gr and rg, which quantify the overall distance between a generated sentence (G) and a reference sentence (R). Eq. 2 (identical to Eq. 1 in Section 4) shows the calculation from G to R and Eq. 3 from R to G: cos(R j , G s (j)) + pos(R j , G s (j)) (3) N and M refer to number of tokens in G and R, respectively. Cosine and positional distances are 11 The mapping from IPA symbols to ARPABET symbols is provided in our GitHub repository.  computed between the i th element of G and another token, which is the most semantically similar word to G i in R. R s (i) is the most semantically similar word to G i and G s (j) is the most semantically similar word to R j : Table 3 shows some example descriptions generated by the GAZE-2SEQ model and corresponding references for a single image. We report the overall SSD scores along with gr and rg values separately.  The number of human descriptions per image varies in DIDEC and as we also removed some captions during preprocessing, images do not have an equal number of captions. Therefore, we report the average number of captions per image for each split, as well as their range, in Table 5.

D Reproducibility
We implemented and trained our models in Python version 3.6 12 and PyTorch version 0.4.1. 13 All models were run on a computer cluster with Debian Linux OS. Each model used a single GPU GeForce 1080Ti, 11GB GDDR5X, with NVIDIA driver version: 418.56 and CUDA version: 10.1. Pre-training with the translated MS COCO dataset took approximately 5 days. NO-GAZE and GAZE-AGG took around 1.5 hours and GAZE-SEQ and GAZE-2SEQ models took 2 hours to fine-tune over the pre-trained model.
Since the pre-trained model and the fine-tuned NO-GAZE, GAZE-AGG and GAZE-SEQ models use essentially the same architecture, they have an equal number of parameters: 85 million. GAZE-2SEQ has more parameters due to the addition of the Gaze LSTM: 100 million.
In all the models, the biases in linear layers were set to 0 and the weights were uniformly sampled from the range (-0.1, 0.1). Embedding weights were initialised uniformly in the range (-0.1, 0.1). LSTM hidden states were initialised to 0.
Below we give details regarding the manuallytuned hyperparameters.

D.1 Hyperparameters for Pre-Training
We experimented with learning rate (0.001, 0.0001), dimensions for the word embeddings and hidden representations (512, 1024) and batch size (64, 128). The best pre-trained model is selected based on its CIDEr score on the validation split of our translated MS COCO dataset, with an earlystopping patience of 20 epochs. We use a learning rate of 0.0001 optimising the Cross-Entropy Loss with the Adam optimiser. The batch size is 128. The image features have 2048 dimensions and the hidden representations 1024. The generations for the validation set are obtained through beam search with a beam width of 5.

D.2 Hyperparameters for Fine-tuning
We experimented with the same set of hyperparameters as in pre-training. The details of the hyperparameters for the selected models were given in the main text. We select the models separately based on CIDEr scores and SSD scores. We train each model type with their selected configuration with 5 different random seeds to set the random behaviour of PyTorch and NumPy. We also turn off the cuDNN benchmark and also set cuDNN to deterministic.