Image Captioning with Very Scarce Supervised Data: Adversarial Semi-Supervised Learning Approach

Constructing an organized dataset comprised of a large number of images and several captions for each image is a laborious task, which requires vast human effort. On the other hand, collecting a large number of images and sentences separately may be immensely easier. In this paper, we develop a novel data-efficient semi-supervised framework for training an image captioning model. We leverage massive unpaired image and caption data by learning to associate them. To this end, our proposed semi-supervised learning method assigns pseudo-labels to unpaired samples via Generative Adversarial Networks to learn the joint distribution of image and caption. To evaluate, we construct scarcely-paired COCO dataset, a modified version of MS COCO caption dataset. The empirical results show the effectiveness of our method compared to several strong baselines, especially when the amount of the paired samples are scarce.


Introduction
Image captioning is a task of automatically generating a natural language description of a given image.It is a highly useful task, in that 1) it extracts the essence from an image into a self-descriptive form of representation, and 2) the output format is a natural language, which exhibits free-form and manageable characteristics useful to applications such as language based image or region retrieval [Johnson et al., 2016;Karpathy and Fei-Fei, 2015;Kim et al., 2019], video summarization [Choi et al., 2018], navigation [Wang et al., 2018a], vehicle control [Kim et al., 2018b].Image captioning exhibits free-form characteristics, since it is not confined to a few number of predefined classes.This enables descriptive analysis of a given image.
Recent research on image captioning has made impressive progress [Anderson et al., 2018;Vinyals et al., 2015].Despite this progress, the Paired Data (  ) Unpaired Image Data (   ) Unpaired Text Data (   ) A pizza with basil leaves served on a plate.
A woman cutting a large cake with one lit candle.
A piece of cake and coffee are on an outdoor table.
A white dog has a purple Frisbee in it's mouth.
A chocolate cake and a fork ready to be eat.
A woman cutting a pizza with a fork and knife.
Figure 1: The proposed data setup utilizes both "paired" and "unpaired" image-caption data.We denote paired data as D p , and unpaired image and caption datasets as D x u and D y u respectively.majority of works are trained via supervised learning that requires a large corpus of caption labeled images such as the MS COCO caption dataset [Lin et al., 2014].Specifically, the dataset is constructed with 1.2M images that were asked the annotators to provide five grammatically plausible sentences for each image.Constructing such human-labeled datasets are an immensely laborious task and time-consuming.This is a challenge of image captioning, because the task heavily requires large data, i.e., data hungry task.
In this work, we present a novel way of leveraging unpaired image and caption data to train data hungry image captioning neural networks.We consider a scenario in which we have a small scale paired image-caption data of a specific domain.We are motivated by the fact that images can be easily obtained from the web and captions can be easily augmented and synthesized by replacing or adding different words for given sentences as done in [Zhang et al., 2015].Moreover, given a sufficient amount of descriptive captions, it is easy to crawl corresponding but noisy images through Google or Flickr image databases [Thomee et al., 2016] to build an image corpus.In this way, we can easily construct a scalable unpaired dataset of images and captions, which requires no (or at least minimal) human effort.
Due to the unpaired nature of images (input) and captions (output supervision), the conven-tional supervision loss can no longer be used directly.We propose to algorithmically assign supervision labels, termed as pseudo-labels, to make unpaired data paired.The pseudo-label is used as a learned supervision label.To develop the mechanism of pseudo-labeling, we use generative adversarial network (GAN) [Goodfellow et al., 2014], for searching pseudo-labels for unpaired data.That is, in order to find the appropriate pseudo-labels for unpaired samples, we utilize an adversarial training method for training a discriminator model.Thereby, the discriminator learns to distinguish between real and fake image-caption pairs and to retrieve pseudo labels as well as to enrich the captioner training.
Our main contributions are summarized as follows.(1) We propose a novel framework for training an image captioner with the unpaired imagecaption data and a small amount of paired data.
(2) In order to facilitate training with unpaired data, we devise a new semi-supervised learning approach by the novel usage of the GAN discriminator.(3) We construct our scarcely-paired COCO dataset, which is the modified version of the MS COCO dataset without pairing information.On this dataset, we show the effectiveness of our method in various challenging setups, compared to strong competing methods.

Related Work
The goal of our work is to deal with unpaired image-caption data for image captioning.Therefore, we mainly focus on image captioning and unpaired data handling literature.Data Issues in Image Captioning.Since the introduction of large-scale datasets such as MS COCO data [Lin et al., 2014], image captioning has been extensively studied in vision and language society [Anderson et al., 2018;Rennie et al., 2017;Vinyals et al., 2015;Xu et al., 2015] by virtue of the advancement of deep convolutional neural networks [Krizhevsky et al., 2012].As neural network architectures become more advanced, they require much larger dataset scale for generalization [Shalev-Shwartz and Ben-David, 2014].Despite the extensive study on better network architectures, the data issues for image captioning, such as noisy data, partially missing data, and unpaired data have been relatively barely studied.
Unpaired image-caption data has been just recently discussed.Gu et al. [2018] introduce a third modal information, Chinese captions, for language pivoting [Utiyama and Isahara, 2007;Wu and Wang, 2007].Feng et al. [2019] propose an unpaired captioning framework which trains a model without image or sentence labels via learning a visual concept detector with external data, OpenImage dataset [Krasin et al., 2017].Chen et al. [2017] approach image captioning as a domain adaptation by utilizing large paired MS COCO data as the source domain and adapting on a separate unpaired image or caption dataset as the target domain.Liu et al. [2018] use self-retrieval for captioning to facilitate training a model with partially labeled data, where the self-retrieval module tries to retrieve corresponding images based on the generated captions.As a separate line of work, there are novel object captioning methods [Anne Hendricks et al., 2016;Venugopalan et al., 2017] that additionally exploit unpaired image and caption data corresponding to a novel word.
Most of aforementioned works [Gu et al., 2018;Anne Hendricks et al., 2016;Venugopalan et al., 2017;Feng et al., 2019] exploit large auxiliary supervised data which is often beyond imagecaption data.To the best of our knowledge, we are the first to study how to handle unpaired image and caption data for image captioning without any auxiliary information but by leveraging scarce paired image-caption data only.Although [Chen et al., 2017] does not use auxiliary information as well, it requires large amounts of paired source data, of which data regime is different from ours.[Liu et al., 2018] is also this case, where they use the full paired MS COCO caption dataset with an additional large unlabeled image set.Our method lies on a very scarce paired source data regime, of which scale is only 1% of the COCO dataset.
Multi-modality in Unpaired Data Handling.By virtue of the advancement on generative modeling techniques, e.g., GAN [Goodfellow et al., 2014], multi-modal translation recently emerged as a popular field.Among many possible modalities, image-to-image translation between two different (and unpaired) domains has been mostly explored.To tackle this issue, the cycle-consistency constraint between unpaired data is exploited in Cy-cleGAN [Zhu et al., 2017] and DiscoGAN [Kim et al., 2017], and it is further improved in UNIT [Liu et al., 2017].
In this work, we regard image captioning as a multi-modal translation.Our work has similar motivation to the unpaired image-to-image translation [Zhu et al., 2017], and unsupervised machine translation [Artetxe et al., 2018;Lample et al., 2018a,b] works or machine translation with monolingual data [Zhang et al., 2018].However, we show that the cycle-consistency does not work on our problem setup due to a significant modality gap.Instead, our results suggest that the traditional label propagation based semi-supervised framework [Zhou et al., 2004] is more effective for our task.Semi-supervised Learning.Our method is motivated by the generative model based semisupervised learning [Chongxuan et al., 2017;Gan et al., 2017;Kingma et al., 2014], which mostly deals with classification labels.In contrast, we leverage caption data and devise a novel model to train the captioning model.

Proposed Framework
In this section, we first brief the standard image caption learning, and describe how we can leverage the unpaired dataset.Then, we introduce an adversarial learning method for obtaining a GAN model that is used for assigning pseudo-labels and encouraging to match the distribution of latent features of images and captions.Lastly, we describe a connection to CycleGAN.

Model
Let us denote a dataset with N p image-caption pairs as D p = {(x i , y i )} Np i=1 .A typical image captioning task is defined as follows: given an image x i , we want a model to generate a caption y i that best describe the image.Traditionally, a captioning model is trained on a large paired dataset (x, y) ∈ D p , e.g., MS COCO dataset, by minimizing the negative log likelihood against the ground truth caption as follows: where L CE = −log q θ (y|x) denotes the cross entropy loss, θ the set of learnable parameters, and q(•) the probability output of the model.Motivated by encoder-decoder based neural machine translation literature [Cho et al., 2014], traditional captioning frameworks are typically implemented as a encoder-decoder architecture [Vinyals et al., 2015], i.e., CNN-RNN.The CNN encoder F (x; θ enc ) produces a latent feature vector z x from a given input image x, and the RNN decoder H(z x ; θ dec ) is followed to generate a caption y in a natural language form from z x , as depicted in Fig. 2. The term q(•) in Eq. ( 1) is typically implemented by the sum of the cross entropy of each word token.For simplicity, we omit θ from here if deducible, e.g., H(z x ).
Learning with Unpaired Data.Our problem deals with unpaired data samples, where the image and caption sets D x u ={x i } Nx i=0 and D y u ={y i }

Ny i=0
are not paired.Given the unpaired datasets, due to the missing supervision, the loss in Eq. ( 1) cannot be directly computed.Motivated by the nearest neighbor aware semi-supervised framework [Shi et al., 2018], we artificially generate pseudo-labels for respective unpaired datasets.Specifically, we retrieve a best match caption ỹi in D y u given a query image x i , and assign it as a pseudo-label, and vice versa (x i for y i ).We abuse the pseudo-label as a function for simplicity, e.g., ỹi = ỹ(x i ).To retrieve a semantically meaningful match, we need a measure to assess matches.We use a discriminator network to determine real or fake pairs by GAN, which we will describe later.With the retrieved pseudo-labels, now we can compute Eq. ( 1) by modifying it as: where λ {•} denote the balance parameters.
Discriminator Learning by Unpaired Feature Matching.We train the criterion to find a semantically meaningful match, so that pseudolabels for each modality are effectively retrieved.We learn a discriminator for that by using a small amount of paired supervised dataset.
We adopt a caption encoder, G(y; θ cap ), which embeds the caption y into a feature z y .This is implemented with a single layer LSTM, and we take the output of the last time step as the caption representation z y .Likewise, given an image x, we obtain z x by the image encoder F (x; θ enc ).Now, we have a comparable feature space of z x and z y .We utilize the discriminator to distinguish whether the pair (z x , z y ) from true paired data (x, y) ∈ D p , i.e., the pair belongs to the real distribution p(x, y) or not.We could use random pairs of (x, y) independently sampled from respective unpaired datasets, but we found that it is detrimental due to uninformative pairs.Instead, we conditionally  Given any image and caption pair, CNN and RNN (LSTM) encoders encode input image and caption into the respective feature spaces.A discriminator (D) is trained to discriminate whether the given feature pairs are real or fake, while the encoders are trained to fool the discriminator.The learned discriminator is also used to assign the most likely pseudo-labels to unpaired samples through the pseudo-label search module.
synthesize z x or z y , to form a synthesized pair that appears to be as realistic as possible.We use the feature transformer networks zy = T v→c (z x ) and zx = T c→v (z y ), where v→c denotes the mapping from visual data to caption data and vice versa, and z denotes the conditionally synthesized feature.
{T } are implemented with multi-layer-perceptron with four FC layers with ReLU nonlinearity.The discriminator D(•, •) learns to distinguish features, real or not.At the same time, the other associated networks F , G, T {•} are learned to fool the discriminator by matching the distribution of paired and unpaired data.Motivated by [Chongxuan et al., 2017], the formulation of this adversarial training is as follows: where and we use the distribution notation p(•) to flexibly refer to any type of the dataset regardless of D p and D u .Note that the first log term is not used for updating any learnable parameters related to θ, {T }, but only used for updating D.
Through alternating training of the discriminator (D) and generator (θ, {T }) similar to [Kim et al., 2018a], the latent feature distribution of paired and unpaired data should be close to each other, i.e., p(z x , z y ) = p(z x , T v→c (z x )) = p(T c→v (z y ), z y ) (refer to the proof in [Chongxuan et al., 2017]).In addition, as the generator is trained, the decision boundary of the discriminator tightens.If the unpaired datasets are sufficiently large such that semantically meaningful matches exist between the different modality datasets, and if the discriminator D is plausibly learned, we can use D to retrieve a proper pseudo-label.Our architecture is illustrated in Fig. 2.
Pseudo-label Assignment.Given an image x ∈ D x u , we retrieve a caption in the unpaired dataset, i.e., ỹ ∈ D y u , that has the highest score obtained by the discriminator, i.e., the most likely caption to be paired with the given image as vice versa for unpaired captions: By this retrieval process over all the unpaired dataset, we have image-caption pairs {(x i , y i )} from the paired data and the pairs with pseudolabels {(x j , ỹj )} and {(x k , y k )} from the unpaired data.However, these pseudo-labels are not noisefree, thus treating them equally with the paired ones is detrimental.Motivated by learning with noisy labels [Lee et al., 2018;Wang et al., 2018b], we define a confidence score for each of the assigned pseudo-labels.We use the output score from the discriminator, as the confidence score, i.e., α x i = D(x i , ỹi ) and α y i = D(x i , y i ), where we denote D(x, y) = D(F (x), G(y)), and α ∈ [0, 1] due to the sigmoid function of the final layer of the discriminator.Therefore, we utilize the confidence scores to assign weights to the unpaired samples.We compute the weighted loss as follows: We jointly train the model on both paired and unpaired data.To ease the training further, we add an additional triplet loss function: by regarding random unpaired samples as negative.This slightly improves the performance.

Connection with CycleGAN
As a representative fully unpaired method, Cy-cleGAN [Zhu et al., 2017] would be the strong baseline at this point, which has been popularly used for unpaired distribution matching.Since it is designed for image-to-image translation, we describe the modifications to it to fit our task, so as to understand relative performance of our method over the CycleGAN baseline.When applying the cycle-consistency on matching between images and captions, since the input modalities are totally different, we modify it to a translation problem over feature spaces as follows: where {T } and D {x,y} denotes the feature translators and the discriminators for image domain and caption domain, and The discriminator D is learned to distinguish whether the latent features is from the image or the caption distribution.This is different with our method, in that we distinguish the correct semantic match of pairs.We experimented with Cycle-GAN purely on unpaired datasets, but we were not able to train it; hence, we add the supervised loss (Eq.( 1)) with the paired data, which is a fair setting with ours.

Experiments
In this section, we describe the experimental setups, competing methods and provide performance evaluations of unpaired captioning with both quantitative and qualitative results.

Implementation Details
We implement our neural networks by PyTorch library [Paszke et al., 2017].We use ResNet101 [He et al., 2016]  We set the channel size of the hidden layers of LSTMs to be 1024, 512 for the attention layer, and 1024 for the word embeddings.For inference stage, we empirically choose the beam size to be 3 when generating a description, which shows the best performance.
We use a minibatch size of 100, the Adam [Ba and Kingma, 2015] optimizer for training (learning rate lr=5e −4 , b 1 =0.9, b 2 =0.999).For hyperparameters, we empirically choose λ x and λ y to be equal to 0.1.The total loss function for training our model is as follows: where L cap denotes the captioning loss defined in Eq. ( 6), L GAN the loss for adversarial training defined in Eq. ( 3), and L triplet the triplet loss defined in Eq. ( 7). .We indicate the ablation study by: (A) the usage of the proposed GAN that distinguishes real or fake image-caption pairs, (B) pseudo-labeling, and (C) noise handling by sample re-weighting.We also compare with [Gu et al., 2018] and [Feng et al., 2019], which are trained with unpaired datasets.

Experimental Setups
We utilize MS COCO caption [Lin et al., 2014] as our target dataset, which contains 123k images with 5 caption labels per image.To validate our model, we follow Karpathy splits [Karpathy and Fei-Fei, 2015], which have been broadly used in image captioning literature.The Karpathy splits contain 113k training, 5k validation, and 5k test images in total.In our experiments, in order to simulate the scenario that both paired and unpaired data exist, we use two different data setups: 1) the proposed scarcely-paired COCO, and 2) partially labeled COCO [Liu et al., 2018] setup.
For the scarcely-paired COCO setup, we remove the pairing information of the MS COCO dataset, while leaving a small fraction of pairs unaltered.We randomly select only one percent of the total data, i.e., 1,133 training images for the paired data, and remove the pairing information of the rest to obtain unpaired data.We call the small fraction of samples given as pairs as paired data (D p ) and call the samples without pairing information as unpaired data (D u ).We study the effects of other ratios of paired data used for training (in Figs. 4 and 3).This dataset allows evaluating the proposed framework to measure whether such small paired data can be leveraged to learn plausible pseudo-label assignment, and what performance can be achieved compared to the fully supervised case.
For partially labeled COCO setup, we follow [Liu et al., 2018] and use the whole MS COCO data (paired) and add the "Unlabeled-COCO"  split of the officially MS-COCO [Lin et al., 2014] for unpaired images, which involves 123k images without any caption label.To compute the cross entropy loss, we use the pseudo-label assignment for the Unlabeled-COCO images.
In order to avoid high time complexity of the pseudo-labeling process, we do not search the pseudo-labels from the whole unpaired data.Pseudo-label retrieval is done on a subset of onehundredth of the unpaired data (yielding 1000 samples).Thus, the complexity of each minibatch becomes O(B × N × 0.01), where B denotes the minibatch size, N denotes the size of the unpaired dataset.Since we also apply label-noise handling, a plausible pseudo-label assignment is sufficient in helping the learning process.We can improve performance by using a larger subset or by using a fast approximate nearest neighbor search.

Evaluation on Scarcely-paired COCO
We follow the same setting with [Vinyals et al., 2015], if not mentioned.Table 1 shows the results on the scarcely-paired COCO dataset.We compare with several baselines: Paired Only; we train our model only on the small fraction (1%) of the paired data, CycleGAN; we train our model with the cycle-consistency loss [Zhu et al., 2017] (as in Eq. ( 8)).Additionally, we train variants of our model denoted as Ours (ver1,ver2,and final).The model denoted as Our ver1, is trained Baseline (1%): a player playing a swinging on her long.

Baseline (10%): a woman playing tennis on
a court with a racket.Baseline (100%): a woman is playing tennis on a court.

Ours (1%)
: a man with a tennis racket on a court.
Baseline (1%) : a black bear wearing skis with a chain on the trees.Baseline (10%) : a black bear is standing in the grass.
Baseline (100%) : a black bear is walking through a field.via our GAN model (Eq.( 3)) that distinguishes real or fake image-caption pairs.For Our ver2, we add training with pseudo-labels for unpaired data with Eq. ( 6) while setting the confidence scores α x =α y =1 for all training samples.For our final model denoted as Our (final), we apply the noise handling technique by the confidence scores α x and α y computed by Eq. ( 6) and re-weighting the loss for each sample.We present the performance of the fully supervised (Fully paired) model using 100% of the COCO training data for reference.
As shown in Table 1, in a scarce data regime, utilizing the unpaired data improves the captioning performance in terms of all metrics by noticeable margins.Also, our models show favorable performance compared to the CycleGAN model in all metrics.Our final model with pseudo-labels and noise handling achieves the best performance in all metrics across all the competitor, hereafter refered to this model as our model.
We also compare the recent unpaired image captioning methods [Gu et al., 2018;Feng et al., 2019] in Table 1.Both of the methods are evaluated on MS COCO testset.In the case of Gu et al., AIC-ICC image-to-Chinese dataset [Wu et al., 2017] is used as unpaired images D We study our final model against our Paired Only baseline according to varying amounts of paired training data in Fig. 3, so that we can see how much information can be gained from the unpaired data.From 100% to 10%, as the amount of paired samples decreases, the fluency and the accuracy of the descriptions get worse.In particular, we observed that most of the captions generated from the Paired Only baseline trained with 10% of paired data (11,329 pairs) show erroneous grammatical structure.In contrast, by leveraging unpaired data, our method can generate more fluent and accurate captions, compared to Paired Only trained on the same amount of paired data.It is worthwhile to note that our model trained with 60% of paired data (67,972 pairs) achieves similar performance to the Paired Only baseline trained with fully paired data (113,287 pairs), which sig-Table 2: Comparison with the semi-supervised image captioning method, "Self-Retrieval" [Liu et al., 2018].Our method shows improved performance even without Unlabeled-COCO data (denoted as w/o unlabeled) as well as with Unlabeled-COCO (with unlabeled), although our model is not originally proposed for such scenario.nifies that our method is able to save near half of the human labeling effort of constructing a dataset.We also show qualitative samples of our results in Fig. 4. Paired Only baseline trained with 1% paired data produces erroneous captions, and the baseline with 10% paired data starts to produce plausible captions.It is based on ten times more paired samples, compared to our model that uses only 1% of them.We highlight that, in the two examples on the top row of Fig. 4, our model generates more accurate captions than the Paired Only baseline trained on the 100% paired data ("baseball" to "Frisbee" on the top-left, and "man" to "woman" on the top-right).This suggests that unpaired data with our method effectively boosts the performance especially when paired data is scarce.
In order to demonstrate the meaningfulness of our pseudo-label assignment, we show the pseudolabels assigned to the unpaired samples.We show the pseudo-labels (captions) assigned to unlabeled images from Unlabeled-COCO images in Fig. 5.As shown in the figure, despite not knowing the real pairs of these images, the pseudo labels are sufficiently assigned by the model.Note that even though there is no ground truth caption for unlabeled images in the searching pool, the model is able to find the most likely (semantically correlated) image-caption pair for the given images.
Fig. 6 highlights interesting result samples of our caption generation, where the results contain words that do not exist in the paired data of the scarcely-paired COCO dataset.The semantic meaning of the words such as "growling," "grow-Two bears growling in the water.
A broccoli grows on a green plant.
Cows being herded by a fence in a field.ing," and "herded," which involve in the unpaired caption data but not in the paired data, may have been learned properly via pseudo-label assignment during training.These examples suggest that our method is capable to infer the semantic meaning of unpaired words to some extent, which would have been unable to be learned with only little paired data.This would be an evidence that our framework is capable to align abstract semantic spaces between two modalities, i.e., visual and caption.

Evaluation on Partially Labeled COCO
For a more realistic setup, we compare with the recent semi-supervised image captioning method, called Self-Retrieval [Liu et al., 2018], on their problem regime, i.e., "partially labeled" COCO setup, where the full amount of the paired MS COCO (113k) plus 123k uncaptioned images of the Unlabeled-COCO set are used (no additional unpaired caption is used).While their regime is not our direct scope, but we show the effectiveness of our method in the regime.Then, we extend our framework by replacing our backbone architecture with recent advanced image caption architectures.
In this setup, a separate unpaired caption data D y u does not exist, thus we use captions from paired COCO data D p as pseudo-labels.
Table 2 shows the comparison with Self-Retrieval.For a fair comparison with them, we replace the cross entropy loss from our loss with the policy gradient method [Rennie et al., 2017] to directly optimize our model with CIDEr score as in [Liu et al., 2018].As our baseline Table 3: Evaluation of our method with different backbone architectures.All models are reproduced and trained with the cross entropy loss.By adding Unlabeled-COCO images, our training method was applied in a semisupervised way, which shows consistent improvement in all the metrics.model (denoted as Baseline), we train a model only with policy gradient method without the proposed GAN model.When only using the 100% paired MS COCO dataset (denoted as w/o unlabeled), our model already shows improved performance over Self-Retrieval.Moreover, when adding Unlabeled-COCO images (denoted as with unlabeled), our model performs favorably against Self-Retrieval in all the metrics.The results suggest that our method is also advantageous in the semi-supervised setup.
To further validate our method in the semisupervised setup, we compare different backbone architectures [Anderson et al., 2018;Rennie et al., 2017;Vinyals et al., 2015] in our framework, where their methods are developed for fullysupervised methods.We use the same data setup with the above, but we replace CNN (F ) and LSTM (H) from our framework with the image encoder and the caption decoder from their image captioning models.Then, these models can be trained in our framework as it is by alternating between the discriminator update and pseudolabeling.Table 3 shows the comparison.Training with the additional Unlabeled-COCO data via our training scheme consistently improves all baselines in all metrics.

Captioning with Web-Crawled Data
To simulate a scenario involving crawled data from the web, we use the setup suggested by Feng et al. [2019].They collect a sentence corpus by crawling the image descriptions from Shutterstock1 as unpaired caption data D y u , whereby 2.2M sentences are collected.For unpaired image data D x u , they use only the images from the MS COCO data, while the captions are not used for training.For training our method, we leverage from 0.5% to 1% of paired MS COCO data as our paired data D p .The results are shown in

Conclusion
We introduce a method to train an image captioning model with a large scale unpaired image and caption data, given a small amount of paired data.Our framework achieves favorable performance compared to various methods and setups.Unpaired captions and images are the data that can be easily collected from the web.It can facilitate application specific captioning models, where labeled data is scarce.

Figure 2 :
Figure 2: Description of the proposed method.Dotted arrows denote the path of the gradients via back-propagation.Given any image and caption pair, CNN and RNN (LSTM) encoders encode input image and caption into the respective feature spaces.A discriminator (D) is trained to discriminate whether the given feature pairs are real or fake, while the encoders are trained to fool the discriminator.The learned discriminator is also used to assign the most likely pseudo-labels to unpaired samples through the pseudo-label search module.

Figure 3 :
Figure 3: Performance w.r.t.amount of paired data for training.Baseline denotes our Paired Only baseline, Ours is our final model, and Reference is Paired Only trained with full paired data.
Ours (1%) : a black bear walking through the grass near the forest.Baseline (1%): a bunch of children white skis on.Baseline (10%): a group of people standing on top of a snow covered slope.Baseline (100%): a group of people standing on top of a snow covered slope.Ours (1%) : a group of people standing on a snow covered slope.Baseline (1%) : a baseball player playing with out a red.Baseline (10%) : a baseball player is getting ready to hit a ball.Baseline (100%) : a group of people playing a game of Frisbee.Ours (1%) : a baseball player playing is swinging a bat at a ball.Baseline (1%) : a train rolling down tracks at the track.Baseline (10%) : a train on track near a station.Baseline (100%) : a train is traveling down the tracks in a city.Ours (1%) : a train traveling down the tracks in front of a train station.Baseline (1%): a little girl skier markers on to snow the slope.Baseline (10%): a person skiing down a snowy slope.Baseline (100%): a man riding skis down a snow covered slope.Ours (1%) : a person standing on skis in a snow covered field.Baseline (1%): a zebra standing over ledge in a French together.Baseline (10%): a zebra standing on top of a lush green field.Baseline (100%): a zebra standing in a field with a tree in the background.Ours (1%) : a zebra standing on a grassy filed.Baseline (1%) : a bunch of sheep in front of the zoo standing.Baseline (10%) : a group of sheep standing next to each other.Baseline (100%) : a herd of sheep standing on top of a lush green field.Ours (1%) a group of sheep standing next to a fence.

Figure 4 :
Figure 4: Sampled qualitative results of our model.We compare with the baseline models trained only on N % of paired samples out of the full MS COCO.Despite the use of only 1% paired data, our model generates plausible captions similar to that of the baseline models trained with more data (10% and above).
x u and captions from MS COCO are used as unpaired captions D y u .Note that this is not a fair comparison to our method but they have more advantages, in that Gu et al. use large amounts of additional labeled data (10M Chinese-English parallel sentences of AIC-MT dataset [Wu et al., 2017]) and Feng et al. use 36M samples of additional OpenImages dataset, whereas our model only uses a small amount of paired samples (1k) and 122k unpaired data.Despite far lower reliance on paired data, our model shows favorable performance against recent unpaired captioning works.

a
bus is parked on a street.a large bear standing on a grass.amotorcycle parked on a street.

Figure 5 :
Figure 5: Examples of the pseudo-labels (captions) assigned to the unpaired images.Our model is able to sufficiently assign image-caption pairs through the proposed adversarial training.

Figure 6 :
Figure 6: Example captions containing words that do not exist in the paired dataset D p .The novel words that are not in D p but in D y u are highlighted in bold.

Table 4 :
Table 4 with the performance obtained by Feng et Performance comparison with web-crawled data (Shutterstock).On top of unpaired image and caption data, our method is trained with 0.5 -1% of paired data, while Feng et al. use 36M additional images of the OpenImage dataset.al.Note again that Feng et al. exploits external large-scale data, i.e., 36M images in the OpenImages dataset.With 0.5% of paired only data (566 pairs), our baseline shows lower scores in terms of BLEU4 and METEOR than Feng et al., while our proposed model shows comparable or favorable performance in BLEU4, ROUGE-L and ME-TEOR.Our method starts to have higher scores in all metrics from 0.8% of paired data (906 pairs), even without additional 36M images.