Interactive Key-Value Memory-augmented Attention for Image Paragraph Captioning

Image paragraph captioning (IPC) aims to generate a fine-grained paragraph to describe the visual content of an image. Significant progress has been made by deep neural networks, in which the attention mechanism plays an essential role. However, conventional attention mechanisms tend to ignore the past alignment information, which often results in problems of repetitive captioning and incomplete captioning. In this paper, we propose an Interactive key-value Memory- augmented Attention model for image Paragraph captioning (IMAP) to keep track of the attention history (salient objects coverage information) along with the update-chain of the decoder state and therefore avoid generating repetitive or incomplete image descriptions. In addition, we employ an adaptive attention mechanism to realize adaptive alignment from image regions to caption words, where an image region can be mapped to an arbitrary number of caption words while a caption word can also attend to an arbitrary number of image regions. Extensive experiments on a benchmark dataset (i.e., Stanford) demonstrate the effectiveness of our IMAP model.


Introduction
Image captioning has received a significant amount of attention in recent years and is applicable in various scenarios such as virtual assistants, image indexing, and support of the disabled. Significant progress has been made to generate a single sentence to describe an image (Karpathy and Fei-Fei, 2015;Anderson et al., 2018). However, a single sentence has limited descriptive capacity and fails to recapitulate every detail of an image, which largely undermines applications of image captioning in real-world scenarios. One recent alternative to sentence-level captioning is image paragraph captioning with the aim of generating a coherent and fine-grained paragraph (usually 4-6 sentences) to describe an image.
Inspired by the successful use of the encoder-decoder framework employed in neural machine translation (NMT) (Bahdanau et al., 2014), most works on image paragraph captioning employ a convolutional neural network (CNN) as an encoder to obtain fixed-length image representations, and then generates image descriptions with a long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) decoder and attention mechanisms. One representative method is to use region-based visual attention to produce a topic vector and then employ language attention to generate caption (Liang et al., 2017).
Although conventional attention-based methods have greatly enhanced the performance of image paragraph captioning, there is no mechanism to effectively keep track of the attention history in learning the dynamic alignment between the neural representations of images and that of natural languages. We argue that lacking coverage (history) information might result in two problems in conventional image paragraph captioning: (i) repetitive captioning that some image regions are unnecessarily accessed for multiple times and (ii) incomplete captioning that some image regions are mistakenly unexplored.
Concretely, the attention at each time step shows which image regions the model should focus on to predict the next target word in the paragraph. However, generating a target word heavily depends on IMAP(w/o memory) IMAP the image is of a large pool . there are people standing on the sidewalk in front of the fence . there are many trees in the background behind the pool . there are a few people walking around the pool . there is a large green tree behind the pool .
the large pool is white and blue in color . there are several umbrellas in front of the pool . there is a large body of water behind the pool . there are people standing on the side of the fence watching the pool . there are many trees in the background . Figure 1: Example paragraph captions generated by IMAP and IMAP w/o memory network. The captions generated by IMAP w/o memory contain repeated phrases (in red) and cannot cover all the salient regions of the images.
the relevant parts of the whole image, and an image region is involved in the generation of the whole paragraph. As a result, repetitive captioning and incomplete captioning inevitably happen because of ignoring the coverage of image regions of interest. Figure 1 shows an example from Stanford dataset. The model (i.e., IMAP w/o memory) which is unaware of the coverage information generates repeated sentences to describe the same region of the image ("there is a large green tree behind the pool"), while the "several umbrellas" and "a large body of water" are unexplored.
In this paper, we propose an Interactive key-value Memory-augmented Attention for image Paragraph captioning (IMAP) to alleviate the repetitive captioning and incomplete captioning problems. Our model exploits the recent success of the hierarchical LSTM to generate image paragraph captions (Krause et al., 2017). A sentence LSTM recursively generates sentence topic vectors conditioned on the image features learned by a CNN encoder, and a word RNN is subsequently adopted to decode each topic into output sentence word by word with an attention mechanism to learn the context image representation at each decoding step. Different from conventional attention methods, the IMAP model generates the image context representation that is appropriate for predicting the next target word with iterative memory access operations conducted on a key-memory and a value-memory, inspired by the memory-augmented attention (Meng et al., 2016;. This mechanism allows the model to track the coverage information typically for each salient object within the image, and therefore avoid repetitive and incomplete captioning. Specifically, we leverage the key-value paired memories to interactively maintain the visual and language features of the input image. The key-memory keeps updated to track the interaction history between the image representation and the decoder by a writing operation, while the value-memory keeps fixed to store the original semantic image features throughout the whole decoding process. In each decoding step, the model learns to address relevant memories based on the key-memory using the "query" state learned from the previous decoder state and previous prediction, and the corresponding values in value-memory are subsequently returned as the image context representation. In addition, we employ an adaptive attention mechanism to realize adaptive alignment from image regions to caption words, where an image region can be mapped to an arbitrary number of caption words while a caption word can also attend to an arbitrary number of image regions. We summarize our main contributions as follows. (1) We propose an interactive key-value memoryaugmented attention to better keep track of attention history and image coverage information, helping the decoder to overcome the repetitive and incomplete captioning problems by automatically distinguishing which parts of the image have been described and which parts are unexplored. (2) We leverage language phrases features with interactive key-value memory-augmented attention to help the model learn better alignment between visual and language features. (3) Experiments on an image paragraph captioning benchmark demonstrate that the proposed method outperforms previous state-of-the-art approaches by a substantial margin, across multiple evaluation metrics.

Related Work
Single Sentence Image Captioning Automatic image captioning involves analyzing the visual content of an input image, and generating a textual sentence that verbalizes its most salient aspects (Xu et al., 2015). Inspired by the success of the encoder-decoder framework in neural machine translation, most recent image captioning methods employ the sequence-to-sequence (seq2seq) model to generate image captions (Xu et al., 2015;Anderson et al., 2018;Rennie et al., 2017;Xu et al., 2019;Huang et al., 2019).
For instance, an attention-based encoder-decoder neural network was proposed in (Xu et al., 2015), which learned to dynamically attend to different locations of the images during decoding different words in the captions. Anderson et al. (2018) proposed a bottom-up and top-down attention mechanism to enable attention to be computed at the level of objects and salient regions. There were also several recent image captioning studies employing reinforcement learning techniques in the encoder-decoder neural networks. For instance, Rennie et al. (2017) presented a self-critical sequence training (SCST) method by considering the optimization of image captioning as a reinforcement learning problem.
Image Paragraph Captioning Since a single sentence has limited descriptive capacity and fails to recapitulate every detail of an image, the task of image paragraph generation has received increasing attention recently, which describes an image with a long, descriptive, and coherent paragraph. Krause et al. (2017) was one of the early image paragraph captioning studies, which employed a sentence-level recurrent neural network (RNN) to generate sentence topic vectors, and then applied a word-level RNN to decode each topic vector into a sentence. Subsequently, Liang et al. (2017) introduced a recurrent topic-transition generative adversarial network (RTT-GAN) to extend the hierarchical RNN by proposing an adversarial framework between a paragraph generator and two multi-level discriminators (sentence discriminator and topic-transition discriminator). Chatterjee and Schwing (2018) augmented the hierarchical RNN by leveraging coherence vectors to ensure cross-sentence topic smoothness and global topic vectors to summarize the overall information of the image.  proposed a convolutional auto-encoding model for image paragraph captioning, which incorporated a convolutional and deconvolutional autoencoding framework for topic modeling on region-level features of an image.
Different from the aforementioned methods, the IMAP model focuses on alleviating the repetitive and incomplete captioning problems in conventional image paragraph captioning by designing an interactive key-value memory-augmented attention mechanism to track the coverage information typically for each salient object within the image.

Our Methodology
Given an image I, image paragraph captioning aims to generate a long paragraph descriptions Y = {y 1 , y 2 , ..., y N }, where N is the number of sentences in the paragraph Y . Each sentence y i = {w y i 1 , w y i 2 , ..., w y i T } consists of T words. As illustrated in Figure 2, the proposed IMAP model consists of three parts: (i) an image encoder, which encodes a query image and outputs a set of visual feature vectors and corresponding dense phrases (language features); (ii) a hierarchical decoder, which leverages a sentence-level LSTM to generate sentence topic vectors, and then employs a word-level RNN to decode each topic vector into a sentence word by word; (iii) an interactive key-value memory-augmented attention module, which is used to keep track of the attention history and encourage the decoder to consider the unexplored salient image regions. Next, we will introduce each part of our IMAP model in detail.

Image Encoder
Following previous works (Johnson et al., 2016;Krause et al., 2017), given an image, we use a dense captioning method, i.e., DenseCap (Johnson et al., 2016) as our image encoder to detect a set of semantic regions and produces the corresponding dense phrases describing the regions in natural language. Formally, we use DenseCap to encode the input image I into M semantic feature vectors, denoted corresponding to the features extracted at different locations of the image. Taking the learned visual features vectors V as input, the language model of DenseCap generates a set of dense phrases S = {s 1 , s 2 , ..., s M }, where the phrase s i corresponds to the semantic visual feature v i . Each short dense phrase s i is composed of m words, denoted as

Hierarchical Decoder
Inspired by the hierarchical LSTM structure in (Krause et al., 2017), we devise a two-level LSTMbased paragraph generator, which is composed of a sentence-level LSTM (Sent-LSTM) for inter-sentence dependency modeling and two word-level LSTMs (Word-LSTMs) for sentence generation conditioning Interactive key-value memoryaugmented attention module on each distilled topic, as illustrated in Figure 2. The sentence LSTM takes as input the image features, and then decides how many sentences to generate in the resulting paragraph and produces an input topic vector for each sentence. Given this topic vector, the word LSTMs generate the sentence word by word.
In the decoding step, we also propose an adaptive attention strategy and an interactive key-value memory network to alleviate the repetitive captioning and incomplete captioning problems.
Sentence-level LSTM Similar to (Krause et al., 2017), we aggregate a set of semantic feature vectors V into a single pooled vectorv which compactly describes the content of the image. Formally, we compute the pooled vector by projecting each region vector using W and taking a max-pooling operation: where W ∈ R u×D (u is the dimension of the pooled visual vector) and b ∈ R u are learned parameters, max denotes the max-pooling.
The sentence LSTM decides the number of sentences that should be included in the generated paragraph and produces a u-dimensional topic vector for each of these sentences. It recursively takes the pooled image vectorv as input and generates a sequence of hidden states [h 1 , h 2 , ..., h L , where h ∈ R H ] (H is the dimension of hidden state) and L is the maximum value of the sentence numbers in a paragraph. Each hidden state h i is fed into a two-layer fully-connected network to compute the topic vector z i ∈ R u for the i-th sentence in the paragraph, which is the input to the word LSTM. In addition, we also use the hidden state h i to compute a probability distribution p i over the two states {CONTINUE=1, STOP=0} via a linear layer, where the label "CONTINUE" indicates that the decoder should generate next sentence, while "STOP" indicates that the current sentence is the last sentence in a paragraph.

Word-level LSTMs
The word-level LSTMs is a two-layer LSTM, which is responsible for generating the words of a sentence. When generating the t-th target word w y i t of the i-th sentence, we use the hidden state of the second LSTM layer (LSTM (2) ) to determine the number of attention steps for the adaptive attention strategy, and derive a weighted average of hidden states of LSTM (2) to generate the target word. Formally, at the n-th attention time step of the decoding time step t, the first word-level LSTM layer (LSTM (1) ) takes as input the concatenation of the topic vector z i , the previous output h i,t,n−1 of the second LSTM layer (LSTM (2) ), and the previous word embedding e(w y i t−1 ) ∈ R E (E is the dimension of word embedding): where x i,t,n is the input of LSTM (1) at attention time step n of decoding time step t for the i-th sentence, w y i t−1 is the generated word at time step t − 1. The hidden state of the LSTM (1) at attention time step n of decoding time step t can be calculated as: h (1) i,t,n−1 ) (3) We use another word-level LSTM (LSTM (2) ) to produce a caption by generating one word at every time step, which takes as input the concatenation of the output of the LSTM (1) and the context image feature c i,t,n . The hidden state of LSTM (2) at attention time step n of decoding time step t is computed by: i,t,n−1 ) (4) Adaptive Attention Strategy In most previous works (Liang et al., 2017;Chatterjee and Schwing, 2018), each target word attends to only one image region, which is not applicable in practice. In this work, we employ an adaptive attention strategy that allows each word attending to several image regions adaptively. To determine the number of attention steps, a confidence network implemented with a multilayer perceptron (MLP) is applied to output the probability distribution a i,t,n of each attention step: i,t,n )) (5) The desired attention steps O(i, t) at the t-th decoding step for the i-th sentence is calculated by: where ε is a parameter to control the total attention steps, which is set to 10 −4 in this work. After finishing all the attention steps, the weighted average of hidden states is calculated as: where In conventional attention-based methods, the context vector c i,t,n is usually calculated as a weighted sum of the whole original image feature vectors V , which ignores the attention history and coverage information of the salient image regions of interest. To make the decoder keep track of previous attention history and attend to the proper image regions at each decoding step, we propose an Interactive Key-value Memory-augmented Attention (IKVMA) to read and update the image feature vectors. We describe the implementation details of IKVMA in Section 3.3. The address operation of IKVMA defined in Eq. (11) and Eq. (12) is used to obtain the visual attention weight, and then the context vector for image features can be computed by the read operation of IKVMA defined in Eq. (13) over the visual key-value memory based on the attention weight, which is denoted as c v i,t,n . The dense phrases generated by DenseCap provide complementary information for the model to learn better alignment between visual and language features. Thus, we combine the visual features and dense phrases to form the context vector. The pooled language feature vectorss for the input image are computed as:s wheres i ∈ R E is the language feature for the phrase s i , m s i is the number of words in the phrase s i , and e(w s i j ) denotes the word embedding of the word w s i j . We apply the address operation of IKVMA defined in Eq. (11) and Eq. (12) to compute the language attention weight, and apply the read operation defined in Eq.(13) over a language key-value memory to produce a context vector for dense phrases, which is defined as c l i,t,n . Note that the visual attention weight is reused to make the decoder attend to proper dense phrases when computing the context vector c l i,t,n , via the element-wise product of the visual attention weight and language attention weight.
We concatenate the visual context vector c v i,t,n and language context vector c l i,t,n to form the final memory-augmented context vector c i,t,n = [c v i,t,n , c l i,t,n ]. Finally, the generation probabilities of the t-th word w y i t for the i-th sentence is computed over the entire vocabulary: where U w ∈ R P ×H , b w ∈ R P are parameters to be learned, P is the number of words in the vocabulary.

Interactive Key-Value Memory-augmented Attention
The IKVMA module consists of two components: a timely updated key-memory K ∈ R M ×d to keep track of attention history and a fixed value-memory A ∈ R M ×d to store the image features throughout the whole decoding process. Both the key-memory and value-memory consists of M slots, and are initialized with the region feature vectors V = {v 1 , v 2 , ..., v M } ∈ R M ×d , which are obtained by embedding each image feature vector v j ∈ R D into a d-dimensional region feature v j through a linear layer. At each decoding step, the j-th slot in key-memory stores the attention status corresponding to the j-th image feature that is updated along with the decoding process, and the j-th slot in value memory stores the representation of the j-th feature vector v j . Key-Memory Addressing At attention time step n of decoding time step t, we get a query vector by taking the concatenation of the hidden state h (1) i,t,n and the previous word embedding e(w y i t−1 ) as input: i,t,n , e(w y i t−1 )], q i,t,n−1 ) (11) where q i,t,n is the query vector for the i-th sentence at attention time step n of decoding time step t, which is used to address from the key-memory. Specifically, we compute the attention vector α i,t,n ∈ R M over the visual key-memory K i,t,n−1 as: where g is a two-layer neural network which projects a vector into a scalar value, K i,t,n−1,j represents the j-th slot of the key-memory at attention time step n − 1 of decoding time step t for the i-th sentence, α i,t,n,j indicates the weight assigned to the j-th memory slot K i,t,n−1,j .
Value-Memory Reading After obtaining the attention weight α i,t,n , the context vector c i,t,n is computed by the weighted sum of all slots in the value-memory A: where A j is the j-th slot in value-memory. Key-Memory Updating The updating process of the key-memory state includes two operations: ERASE and ADD. The ERASE operation decides the content to be removed from the memory state, which is similar to the forget gate in LSTM. With ERASE operation, the model can avoid exploring the same image location for multiple times, and therefore alleviate the repetitive captioning problem. Formally, the key-memory state after the erase operation is: where F i,t,n = σ(W e , h i,t,n ), W e ∈ R d×H is a learnable parameter, σ is the Sigmoid activation function, and F i,t,n ∈ R d . ω i,t,n,j indicates the weight of the j-th slot of the memory state, which is computed by: where g is defined in Eq. (12). The ADD operation decides how much current information (new information) should be added to the visual key-memory state to track the dynamic interaction between the key-memory and the decoder, which is computed as: K i,t,n,j =K i,t,n,j + ω i,t,n,jFi,t,n i,t,n ), W a ∈ R d×H is a learnable parameter, andF i,t,n ∈ R d .

Training Procedure
Our model is trained in an end-to-end manner using the training data (I, Y ), where I is an image, and Y is the corresponding human-annotated image description. We assume that Y consists of N sentences and each sentence y i contains T words. Our overall training loss consists of two cross-entropy losses (the sentence loss and the word loss) and a "attention time loss" used as the time cost penalty for the adaptive attention strategy, which is minimized as follows: where θ is a set of parameters, p i is the probability distribution over the states [CONTINUE=1,STOP=0] for the i-th sentence. Note that the target state for the i-th sentence is set to 0 when i = N , otherwise 1.
To improve the performance of our model, we apply a policy gradient method (Williams, 1992) to optimize the model after the cross-entropy training by minimizing the negative expected rewards: (18) where we choose the CIDEr metric as the reward function r.
Following (Rennie et al., 2017), the gradient of the expected rewards can be approximated as: θ J(θ) ≈ −(r(w y 1:i 1:t ) − r(ŵ y 1:i 1:t )) θ log p(w y 1:i 1:t ) (19) where w y 1:i 1:t is a paragraph caption sampled by Monte-Carlo method, andŵ y 1:i 1:t is a greedy decoding caption paragraph used as the baseline to reduce the variance of the gradient estimate.

Experimental Setup
Dataset We conduct the experiments and evaluate our IMAP model on the widely used Stanford image paragraph dataset (Krause et al., 2017), which is the only open source benchmark dataset available for image paragraph captioning. Stanford dataset is collected from MS COCO (Lin et al., 2014) and Visual Genome . This dataset consists of 14,575 images for training, 2,487 images for validation, and 2,489 images for testing, each of which has one human-annotated paragraph.
Implementation Details For each image, Faster R- CNN (Ren et al., 2015) initialized with the VGG-16 network (Simonyan and Zisserman, 2014) is applied to detect objects of the image, and top M = 50 detected regions are selected as the semantic feature vectors. The size of each feature vector is 4,096, and the word embedding size to 512. We set the maximum number of sentences in each paragraph to L = 6, the maximum length of each sentence to 30 via padding operation, and the maximum length of dense phrases to 8. The hidden sizes of both Sent-LSTM and two stacked Word-LSTMs are set to 512. The number of hidden units in the attention layer is 512. We set λ w , λ s and λ a to 1.0, 5.0 and 10 −4 respectively. The maximum number of attention steps is set to 4. The vocabulary used in the experiment is the same as (Krause et al., 2017). We first pre-train our model with the cross-entropy loss function, and use Adam optimizer (Kingma and Ba, 2014) with an initial learning rate 5 × 10 −4 to learn the model. After that, the self-critical training method with CIDEr as the reward is used to further optimize the model. During this stage, the initial learning rate of Adam optimizer is set to 5 × 10 −5 .

Baseline Methods
In the experiments, we compare the proposed IMAP with the following state-ofthe-art methods: DenseCap-Concat (Johnson et al., 2016) al., 2018) to generate paragraph captions, ICAP that extends the IAP model by using the coverage vector (Tu et al., 2016) to summarize the attention records during the decoding process.

Automatic Evaluation Results
We quantitatively evaluate our model for across six automatic evaluation metrics that are widely used in previous work (Krause et al., 2017;Liang et al., 2017;Chatterjee and Schwing, 2018), including BLEU-N (N=1,2,3,4) (Papineni et al., 2002), METEOR (Denkowski and Lavie, 2014), CIDEr (Vedantam et al., 2015). These metrics estimate the consistency between the n-gram existence in the produced image descriptions and the ground truth captions. The experimental results on Stanford dataset are summarized in Table 1. IMAP model achieves significantly better performance than the state-of-the-art competitors on most of the automatic evaluation measures. Concretely, IMAP model successfully yields better scores on all evaluation metrics compared to the Regions-Hierarchical model that utilizes the similar basic hierarchical LSTMs backbone as ours. In addition, IMAP also achieves better scores than ICAP that uses the coverage mechanism to alleviate the repetitive and incomplete captioning problems, which verifies the effectiveness of our interactive memory-augmented attention.

Ablation Study
To analyze the effect of each component of the IMAP model, we also perform the ablation test of IMAP in terms of discarding the interactive key-value memory network (denoted as w/o memory) and removing the language information (denoted as w/o language). In addition, to investigate the effect of self-critical training method, we demonstrate the experimental results of the models trained with both cross-entropy and CIDEr-optimization methods. The ablation test results are reported in Table 2. We can observe that both the interactive key-value memory-augmented attention and language features contribute greatly to our model. Benefiting from the memory states which keep track of the attention history, the decoder can capture important information and selectively attend to the image regions, especially the unexplored areas, which alleviates the repetitive and incomplete captioning problems. Meanwhile, the language features (dense phrases produced by DenseCap) provide complementary information for the model to learn better alignment between visual and language features, thus the model can generate more precise and fine-grained image descriptions. In addition, we can also observe that the model with policy gradient substantially outperforms the model with cross-entropy by a noticeable margin on all the evaluation metrics. This is because that the policy gradient update is able to bypass the exposure bias and nondifferentiable evaluation metrics issue, and maximize long-term reward in paragraph caption generation. a baseball player is standing on a baseball field . the player is holding a bat in his hand . the batter is wearing a black shirt and black pants . there are many people sitting in the stands watching the game . there is a catcher squatting on the field behind the player.
a baseball game is being played on a baseball field . the batter is wearing a white uniform with a black helmet . there is a catcher in a white uniform standing behind the man on the left . the field is covered with dirt and grass . the field has green grass and dirt on the ground .
three men are on a baseball field . the batters holds a black bat over the right shoulder . the batter wears red tshirt and white pants . the catcher is crouched , and he wears blue top with white pants . behind the catcher , the umpire is looking the game . people are sitting on the bleachers looking the game . Figure 3: Example image captions generated by Region-Hierarchical and IMAP. The words with colors indicate accurate semantic matches between the generated paragraphs and ground truth paragraphs.
a woman is standing on a tennis court playing tennis.
she is wearing a black tank top and black shorts.
the woman is holding a black tennis racket in her hand.
the court is green and white in color.
the court is surrounded by a fence and a tall green tree. Figure 4: Examples of generated paragraphs with attended image regions. We visualize the mean attention weights on individual pixels for each sentence.The white regions indicate the regions where the model roughly attends to when generating the sentences.

Human Evaluation Results
We also use human evaluation to verify the proposed model. In particular, we randomly selected 200 images from the test set and invited 4 well-educated volunteers to judge the quality of the generated captions of different models. For a generated paragraph caption, a score of +2 indicates the caption is fluent and informative; +1 indicates that the description is fluent but too universal; 0 indicates that the caption is not fluent or contains objects that do not exist in the image. The proportion of each score (0,+1,+2) and the average score are reported for each model. Table 3 demonstrates the results of human evaluation. Consistent with the results of automatic evaluation metrics, the proposed IMAP model can generate more relevant, informative and natural captions than other models.

Qualitative Results
To evaluate the proposed IMAP model qualitatively, we show some image paragraph captions generated by IMAP and Region-Hierarchical model in Figure 3. IMAP can generate coherent, non-repetitive and comprehensive paragraphs by leveraging the interactive key-value memory-augmented attention to keep track of the image coverage information. On the contrary, Region-Hierarchical (RH) model is prone to generate duplicate phrases within a paragraph. Taking the first case in Figure 3 as an example, the RH model generates repetitive phrases "traffic lights on the right side" and "traffic light is in front" within a paragraph. In addition, it also misses the salient objects "people" and "buildings" in the generated paragraph while IMAP generates the corresponding description " people walking" and "many buildings".
The interactive key-value memory-augmented attention is supposed to keep track of the attention history and the image coverage information. To verify this, in Figure 4, we visualize the attended image regions when generating different sentences in the paragraph. We can observe that IMAP is able to focus on the correct image regions when generating the corresponding sentences. For example, our model can attend to the image object "tennis racket" when generating the sentence "the woman is holding a black tennis racket in her hand". The advantage of IMAP comes from keeping track of attention history and image coverage information.

Conclusion
In this paper, we proposed an effective interactive key-value memory-augmented attention to alleviate repetitive and incomplete captioning problems in image paragraph captioning, which maintains a timely updated key-memory to track attention history and a fixed value-memory to store the image features during the whole decoding process. To verify the effectiveness of the proposed model, we conducted extensive experiments on the widely used Stanford dataset. The experimental results demonstrated that our model achieved impressive results compared to state-of-the-art image paragraph generation techniques.