Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning

Visual language grounding is widely studied in modern neural image captioning systems, which typically adopts an encoder-decoder framework consisting of two principal components: a convolutional neural network (CNN) for image feature extraction and a recurrent neural network (RNN) for language caption generation. To study the robustness of language grounding to adversarial perturbations in machine vision and perception, we propose Show-and-Fool, a novel algorithm for crafting adversarial examples in neural image captioning. The proposed algorithm provides two evaluation approaches, which check if we can mislead neural image captioning systems to output some randomly chosen captions or keywords. Our extensive experiments show that our algorithm can successfully craft visually-similar adversarial examples with randomly targeted captions or keywords, and the adversarial examples can be made highly transferable to other image captioning systems. Consequently, our approach leads to new robustness implications of neural image captioning and novel insights in visual language grounding.


Introduction
In recent years, language understanding grounded in machine vision and perception has made remarkable progress in natural language processing (NLP) and artificial intelligence (AI), such as image captioning and visual question answering. Image captioning is a multimodal learning task and has been used to study the interaction between language and vision models (Shekhar et al., 2017). It takes an image as an input and generates a language caption that best describes its visual contents, and has many important applications such as developing image search engines with complex natural language queries, building AI agents that can see and talk, and promoting equal web access for people who are blind or visually impaired. Modern image captioning systems typically adopt an encoder-decoder framework composed of two principal modules: a convolutional neural network (CNN) as an encoder for image feature extraction and a recurrent neural network (RNN) as a decoder for caption generation. This CNN+RNN architecture includes popular image captioning models such as Show-and-Tell (Vinyals et al., 2015), Show-Attend-and-Tell  and Neu-ralTalk (Karpathy and Li, 2015).
Recent studies have highlighted the vulnerability of CNN-based image classifiers to adversarial examples: adversarial perturbations to benign images can be easily crafted to mislead a well-trained classifier, leading to visually indistinguishable adversarial examples to human (Szegedy et al., 2014;Goodfellow et al., 2015). In this study, we investigate a more challenging problem in visual language grounding domain that evaluates the robustness of multimodal RNN in the form of a CNN+RNN architecture, and use neural image captioning as a case study. Note that crafting adversarial examples in image captioning tasks is strictly harder than in well-studied image classification tasks, due to the following reasons: (i) class attack v.s. caption attack: unlike classification tasks where the class labels are well defined, the output of image captioning is a set of top-ranked captions. Simply treating different captions as distinct classes will result in an enormous number of classes that can even precede the number of training images. In addition, semantically similar Figure 1: Adversarial examples crafted by Showand-Fool using the targeted caption method. The target captioning model is Show-and-Tell (Vinyals et al., 2015), the original images are selected from the MSCOCO validation set, and the targeted captions are randomly selected from the top-1 inferred caption of other validation images. captions can be expressed in different ways and hence should not be viewed as different classes; and (ii) CNN v.s. CNN+RNN: attacking RNN models is significantly less well-studied than attacking CNN models. The CNN+RNN architecture is unique and beyond the scope of adversarial examples in CNN-based image classifiers.
In this paper, we tackle the aforementioned challenges by proposing a novel algorithm called Show-and-Fool. We formulate the process of crafting adversarial examples in neural image captioning systems as optimization problems with novel objective functions designed to adopt the CNN+RNN architecture. Specifically, our objective function is a linear combination of the distortion between benign and adversarial examples as well as some carefully designed loss functions. The proposed Show-and-Fool algorithm provides two approaches to craft adversarial examples in neural image captioning under different scenarios: 1. Targeted caption method: Given a targeted caption, craft adversarial perturbations to any image such that its generated caption matches the targeted caption. 2. Targeted keyword method: Given a set of keywords, craft adversarial perturbations to any image such that its generated caption contains the specified keywords. The captioning model has the freedom to make sentences with target keywords in any order.
As an illustration, Figure 1 shows an adversarial example crafted by Show-and-Fool using the targeted caption method. The adversarial perturbations are visually imperceptible while can successfully mislead Show-and-Tell to generate the targeted captions. Interestingly and perhaps surprisingly, our results pinpoint the Achilles heel of the language and vision models used in the tested image captioning systems. Moreover, the adversarial examples in neural image captioning highlight the inconsistency in visual language grounding between humans and machines, suggesting a possible weakness of current machine vision and perception machinery. Below we highlight our major contributions: •

Related Work
In this section, we review the existing work on visual language grounding, with a focus on neural image captioning. We also review related work on adversarial attacks on CNN-based image classifiers. Due to space limitations, we defer the second part to the supplementary material. Visual language grounding represents a family of multimodal tasks that bridge visual and natural language understanding. Typical examples include image and video captioning (Karpathy and Li, 2015;Vinyals et al., 2015;Donahue et al., 2015b;Pasunuru and Bansal, 2017;Venugopalan et al., 2015), visual dialog (Das et al., 2017;De Vries et al., 2017), visual question answering (Antol et al., 2015;Fukui et al., 2016;Lu et al., 2016;Zhu et al., 2017), visual storytelling (Huang et al., 2016), natural question generation (Mostafazadeh et al., 2017, and image generation from captions (Mansimov et al., 2016;Reed et al., 2016). In this paper, we focus on studying the robustness of neural image captioning models, and believe that the proposed method also sheds lights on robustness evaluation for other visual language grounding tasks using a similar multimodal RNN architecture.
Many image captioning methods based on deep neural networks (DNNs) adopt a multimodal RNN framework that first uses a CNN model as the encoder to extract a visual feature vector, followed by a RNN model as the decoder for caption generation. Representative works under this framework include (Chen and Zitnick, 2015;Devlin et al., 2015;Donahue et al., 2015a;Karpathy and Li, 2015;Mao et al., 2015;Vinyals et al., 2015;Xu et al., 2015;Liu et al., 2017a,b), which are mainly differed by the underlying CNN and RNN architectures, and whether or not the attention mechanisms are considered. Other lines of research generate image captions using semantic information or via a compositional approach Gan et al., 2017;Tran et al., 2016;Jia et al., 2015;You et al., 2016).
The recent work in (Shekhar et al., 2017) touched upon the robustness of neural image captioning for language grounding by showing its insensitivity to one-word (foil word) changes in the language caption, which corresponds to the untargeted attack category in adversarial examples. In this paper, we focus on the more challenging targeted attack setting that requires to fool the captioning models and enforce them to generate prespecified captions or keywords.

Overview of the Objective Functions
We now formally introduce our approaches to crafting adversarial examples for neural image captioning. The problem of finding an adversarial example for a given image I can be cast as the following optimization problem: (1) Here δ denotes the adversarial perturbation to I. δ 2 2 = (I + δ) − I 2 2 is an 2 distance metric between the original image and the adversarial image. loss(·) is an attack loss function which takes different forms in different attacking settings. We will provide the explicit expressions in Sections 3.2 and 3.3. The term c > 0 is a pre-specified regularization constant. Intuitively, with larger c, the attack is more likely to succeed but at the price of higher distortion on δ. In our algorithm, we use a binary search strategy to select c. The box constraint on the image I ∈ [−1, 1] n ensures that the adversarial example I + δ ∈ [−1, 1] n lies within a valid image space.
For the purpose of efficient optimization, we convert the constrained minimization problem in (1) into an unconstrained minimization problem by introducing two new variables y ∈ R n and w ∈ R n such that y = arctanh(I) and w = arctanh(I + δ) − y, where arctanh denotes the inverse hyperbolic tangent function and is applied element-wisely. Since tanh(y i + w i ) ∈ [−1, 1], the transformation will automatically satisfy the box constraint. Consequently, the constrained optimization problem in (1) is equivalent to min w∈R n c · loss(tanh(w + y)) (2) + tanh(w + y) − tanh(y) 2 2 .
In the following sections, we present our designed loss functions for different attack settings.

Targeted Caption Method
Note that a targeted caption is denoted by where S t indicates the index of the t-th word in the vocabulary list V, S 1 is a start symbol and S N indicates the end symbol. N is the length of caption S, which is not fixed but does not exceed a predefined maximum caption length. To encourage the neural image captioning system to output the targeted caption S, one needs to ensure the log probability of the caption S conditioned on the image I + δ attains the maximum value among all possible captions, that is, where Ω is the set of all possible captions. It is also common to apply the chain rule to the joint probability and we have log P (S t |I+δ, S 1 , ..., S t−1 ).
In neural image captioning networks, p(S t |I + δ, S 1 , ..., S t−1 ) is usually computed by a RNN/LSTM cell f , with its hidden state h t−1 and input S t−1 : ] ∈ R |V| is a vector of the logits (unnormalized probabilities) for each possible word in the vocabulary. The vector p t represents a probability distribution on V with each coordinate p (i) t defined as: Following the definition of softmax function: Intuitively, to maximize the targeted caption's probability, we can directly use its negative log probability (5) as a loss function. The inputs of the RNN are the first N − 1 words of the targeted caption (S 1 , S 2 , ..., S N −1 ).
Applying (5) to (2), the formulation of targeted caption method given a targeted caption S is: Alternatively, using the definition of the softmax function, (3) can be simplified as Instead of making each z (St) t as large as possible, it is sufficient to require the target word S t to attain the largest (top-1) logit (or probability) among all the words in the vocabulary at position t. In other words, we aim to minimize the difference between the maximum logit except S t , denoted by max k∈V,k =St {z (k) t }, and the logit of S t , denoted by z (St) t . We also propose a ramp function on top of this difference as the final loss function: We note that (Carlini and Wagner, 2017) has reported that in CNN-based image classification, using logits in the attack loss function can produce better adversarial examples than using probabilities, especially when the target network deploys some gradient masking schemes such as defensive distillation (Papernot et al., 2016b). Therefore, we provide both logit-based and probability-based attack loss functions for neural image captioning.

Targeted Keyword Method
In addition to generating an exact targeted caption by perturbing the input image, we offer an intermediate option that aims at generating captions with specific keywords, denoted by K := {K 1 , · · · , K M } ⊂ V. Intuitively, finding an adversarial image generating a caption with specific keywords might be easier than generating an exact caption, as we allow more degree of freedom in caption generation. However, as we need to ensure a valid and meaningful inferred caption, finding an adversarial example with specific keywords in its caption is difficult in an optimization perspective. Our target keyword method can be used to investigate the generalization capability of a neural captioning system given only a few keywords.
In our method, we do not require a target keyword K j , j ∈ [M ] to appear at a particular position. Instead, we want a loss function that allows K j to become the top-1 prediction (plus a confidence margin ) at any position. Therefore, we propose to use the minimum of the hinge-like loss terms over all t ∈ [N ] as an indication of K j appearing at any position as the top-1 prediction, leading to the following loss function: (8) We note that the loss functions in (4) and (5) require an input S t−1 to predict z t for each t ∈ {2, . . . , N }. For the targeted caption method, we use the targeted caption S as the input of RNN. In contrast, for the targeted keyword method we no longer know the exact targeted sentence, but only require the presence of specified keywords in the final caption. To bridge the gap, we use the originally inferred caption S 0 = (S 0 1 , · · · , S 0 N ) from the benign image as the initial input to RNN. Specifically, after minimizing (8) for T iterations, we run inference on I + δ and set the RNN's input S 1 as its current top-1 prediction, and continue this process. With this iterative optimization process, the desired keywords are expected to gradually appear in top-1 prediction.
Another challenge arises in targeted keyword method is the problem of "keyword collision". When the number of keywords M ≥ 2, more than one keywords may have large values of at a same position t. For example, if dog and cat are top-2 predictions for the second word in a caption, the caption can either start with "A dog ..." or "A cat ...". In this case, despite the loss (8) being very small, a caption with both dog and cat can hardly be generated, since only one word is allowed to appear at the same position. To alleviate this problem, we define a gate function g t,j (x) which masks off all the other keywords when a keyword becomes top-1 at position t: where A is a predefined value that is significantly larger than common logits values. Then (8) becomes: The log-prob loss for targeted keyword method is discussed in the Supplementary Material.

Experimental Setup and Algorithms
We performed extensive experiments to test the effectiveness of our Show-and-Fool algorithm and study the robustness of image captioning systems under different problem settings. In our experiments 1 , we use the pre-trained TensorFlow implementation 2 of Show-and-Tell (Vinyals et al., 2015) with Inception-v3 as the CNN for visual feature extraction. Our testbed is Microsoft COCO (Lin et al., 2014) (MSCOCO) data set. Although some more recent neural image captioning systems can achieve better performance than Show-and-Tell, they share a similar framework that uses CNN for feature extraction and RNN for caption generation, and Show-and-Tell is the vanilla version of this CNN+RNN architecture. Indeed, we find that the adversarial examples on Show-and-Tell are transferable to other image captioning models such as Show-Attend-and-Tell  and NeuralTalk2 3 , suggesting that the attention mechanism and the choice of CNN and RNN architectures do not significantly affect the robustness. We also note that since Show-and-Fool is the first work on crafting adversarial examples for neural image captioning, to the best of our knowledge, there is no other method for comparison.
We use ADAM to minimize our loss functions and set the learning rate to 0.005. The number of iterations is set to 1, 000. All the experiments are performed on a single Nvidia GTX 1080 Ti GPU. For targeted caption and targeted keyword methods, we perform a binary search for 5 times to find the best c: initially c = 1, and c will be increased by 10 times until a successful adversarial example is found. Then, we choose a new c to be the average of the largest c where an adversarial example can be found and the smallest c where an adversarial example cannot be found. We fix = 1 except for transferability experiments. For each experiment, we randomly select 1,000 images from the MSCOCO validation set. We use BLEU-1 (Papineni et al., 2002), BLEU-2, BLEU-3, BLEU-4, ROUGE (Lin, 2004) and METEOR (Lavie and Agarwal, 2005) scores to evaluate the correlations between the inferred captions and the targeted captions. These scores are widely used in NLP community and are adopted by image captioning systems for quality assessment. Throughout this section, we use the logits loss (7)(9). The results of using the log-prob loss (5) are similar and are reported in the supplementary material.

Targeted Caption Results
Unlike the image classification task where all possible labels are predefined, the space of possible captions in a captioning system is almost infinite. However, the captioning system is only able to  output relevant captions learned from the training set. For instance, the captioning model cannot generate a passive-voice sentence if the model was never trained on such sentences. Therefore, we need to ensure that the targeted caption lies in the space where the captioning system can possibly generate. To address this issue, we use the generated caption of a randomly selected image (other than the image under investigation) from MSCOCO validation set as the targeted caption S. The use of a generated caption as the targeted caption excludes the effect of out-of-domain captioning, and ensures that the target caption is within the output space of the captioning network.
Here we use the logits loss (7) plus a 2 distortion term (as in (2)) as our objective function. A successful adversarial example is found if the inferred caption after adding the adversarial perturbation δ is exactly the same as the targeted caption. In our setting, 1,000 ADAM iterations take about 38 seconds for one image. The overall success rate and average distortion of adversarial perturbation δ are shown in Table 1. Among all the tested images, our method attains 95.8% attack success rate. Moreover, our adversarial examples have small 2 distortions and are visually identical to the original images, as displayed in Figure 1. We also examine the failed adversarial examples and summarize their statistics in Table 2. We find that their generated captions, albeit not entirely identical to the targeted caption, are in fact highly correlated to the desired one. Overall, the high success rate and low 2 distortion of adversarial examples clearly show that Show-and-Tell is not robust to targeted adversarial perturbations.

Targeted Keyword Results
In this task, we use (9) as our loss function, and choose the number of keywords M = {1, 2, 3}. We run an inference step on I + δ every T = 5 iterations, and use the top-1 caption as the input of RNN/LSTMs. Similar to Section 4.2, for each image the targeted keywords are selected from the caption generated by a randomly selected validation set image. To exclude common words like "a", "the", "and", we look up each word in the targeted sentence and only select nouns, verbs, adjectives or adverbs. We say an adversarial image is successful when its caption contains all specified keywords. The overall success rate and average distortion are shown in Table 1. When compared to the targeted caption method, targeted keyword method achieves an even higher success rate (at least 96% for 3-keyword case and at least 97% for 1-keyword and 2-keyword cases). Figure 2 shows an adversarial example crafted from our targeted keyword method with three keywords -"dog", "cat" and "frisbee". Using Show-and-Fool, the top-1 caption of a cake image becomes "A dog and a cat are playing with a frisbee" while the adversarial image remains visually indistinguishable to the original one. When M = 2 and 3, even if we cannot find an adversarial image yielding all specified keywords, we might end up with a caption that contains some of the keywords (partial success). For example, when M = 3, Table 3 shows the number of keywords appeared in the captions (M ) for those failed examples (not all 3 targeted keywords are found). These results clearly show that the 4% failed examples are still partially successful: the generated captions contain about 1.5 targeted keywords on average.

Transferability of Adversarial Examples
It has been shown that in image classification tasks, adversarial examples found for one machine Figure 2: An adversarial example ( δ 2 = 1.284) of an cake image crafted by the Show-and-Fool targeted keyword method with three keywords -"dog", "cat" and "frisbee". learning model may also be effective against another model, even if the two models have different architectures (Papernot et al., 2016a;Liu et al., 2017c). However, unlike image classification where correct labels are made explicit, two different image captioning systems may generate quite different, yet semantically similar, captions for the same benign image. In image captioning, we say an adversarial example is transferable when the adversarial image found on model A with a target sentence S A can generate a similar (rather than exact) sentence S B on model B.
In our setting, model A is Show-and-Tell, and we choose Show-Attend-and-Tell  as model B. The major differences between Show-and-Tell and Show-Attend-and-Tell are the addition of attention units in LSTM network for caption generation, and the use of last convolutional layer (rather than the last fully-connected layer) feature maps for feature extraction. We use Inception-v3 as the CNN architecture for both models and train them on the MSCOCO 2014 data set. However, their CNN parameters are different due to the fine-tuning process.  -and-Tell to Show-Attend-and-Tell, using different and c. ori indicates the scores between the generated captions of the original images and the transferred adversarial images on Show-Attend-and-Tell. tgt indicates the scores between the targeted captions on Show-and-Tell and the generated captions of transferred adversarial images on Show-Attendand-Tell. A smaller ori or a larger tgt value indicates better transferability. mis measures the differences between captions generated by the two models given the same benign image (model mismatch). When C = 1000, = 10, tgt is close to mis, indicating the discrepancy between adversarial captions on the two models is mostly bounded by model mismatch, and the adversarial perturbation is highly transferable.
tgt ori tgt ori tgt ori tgt ori tgt ori tgt ori tgt ori tgt ori tgt mis    To measure the mismatch between Show-and-Tell and Show-Attend-and-Tell, we generate captions of the same set of 1,000 original images from both models, and report their mutual BLEU, Table 4 Table 4). Small values of ori suggest that the adversarial images on Show-Attend-and-Tell generate significantly different captions from original images' captions. Large values of tgt suggest that the adversarial images on Show-Attend-and-Tell generate similar adversarial captions as on the Showand-Tell model. We find that increasing c or helps to enhance transferability at the cost of larger (but still acceptable) distortion. When C = 1, 000 and = 10, Show-and-Fool achieves the best transferability results: tgt is close to mis, indicating that the discrepancy between adversarial captions on the two models is mostly bounded by the intrinsic model mismatch rather than the transferability of adversarial perturbations, and implying that the adversarial perturbations are easily transferable. In addition, the adversarial examples generated by our method can also fool NeuralTalk2. When c = 10 4 , = 10, the average 2 distortion, BLEU-4 and METEOR scores between the original and transferred adversarial captions are 38.01, 0.440 and 0.473, respectively. The high transferability of adversarial examples crafted by Show-and-Fool also indicates the problem of common robustness leakage between different neural image captioning models.

Attacking Image Captioning v.s. Attacking Image Classification
In this section we show that attacking image captioning models is inherently more challenging than attacking image classification models. In the classification task, a targeted attack usually becomes harder when the number of labels increases, since an attack method needs to change the classification prediction to a specific label over all the possible labels. In the targeted attack on image captioning, if we treat each caption as a label, we need to change the original label to a specific one over an almost infinite number of possible labels, corresponding to a nearly zero volume in the search space. This constraint forces us to develop non-trivial methods that are significantly different from the ones designed for attacking image classification models.
To verify that the two tasks are inherently different, we conducted additional experiments on attacking only the CNN module using two stateof-the-art image classification attacks on Ima-geNet dataset. Our experiment setup is as follows. Each selected ImageNet image has a label corresponding to a WordNet synset ID. We randomly selected 800 images from ImageNet dataset such that their synsets have at least one word in common with Show-and-Tell's vocabulary, while ensuring the Inception-v3 CNN (Showand-Tell's CNN) classify them correctly. Then, we perform Iterative Fast Gradient Sign Method (I-FGSM) (Kurakin et al., 2017) and Carlini and Wagner's (C&W) attack (Carlini and Wagner, 2017) on these images. The attack target labels are randomly chosen and their synsets also have at least one word in common with Showand-Tell's vocabulary. Both I-FGSM and C&W achieve 100% targeted attack success rate on the Inception-v3 CNN. These adversarial examples were further employed to attack Show-and-Tell model. An attack is considered successful if any word in the targeted label's synset or its hypernyms up to 5 levels is presented in the resulting caption. For example, for the chain of hypernyms 'broccoli'⇒'cruciferous vegetable'⇒'vegetable, veggie, veg'⇒'produce, green goods, green groceries, garden truck'⇒'food, solid food', we in-clude 'broccoli','cruciferous','vegetable','veggie' and all other following words. Note that this criterion of success is much weaker than the criterion we use in the targeted caption method, since a caption with the targeted image's hypernyms does not necessarily leads to similar meaning of the targeted image's captions. To achieve higher attack success rates, we allow relatively larger distortions and set ∞ = 0.3 (maximum ∞ distortion) in I-FGSM and κ = 10, C = 100 in C&W. However, as shown in Table 1, the attack success rates are only 34.5% for I-FGSM and 22.4% for C&W, respectively, which are much lower than the success rates of our methods despite larger distortions. This result further confirms that performing targeted attacks on neural image captioning requires a careful design (as proposed in this paper), and attacking image captioning systems is not a trivial extension to attacking image classifiers.

Conclusion
In this paper, we proposed a novel algorithm, Show-and-Fool, for crafting adversarial examples and providing robustness evaluation of neural image captioning. Our extensive experiments show that the proposed targeted caption and keyword methods yield high attack success rates while the adversarial perturbations are still imperceptible to human eyes. We further demonstrate that Showand-Fool can generate highly transferable adversarial examples. The high-quality and transferable adversarial examples in neural image captioning crafted by Show-and-Fool highlight the inconsistency in visual language grounding between humans and machines, suggesting a possible weakness of current machine vision and perception machinery. We also show that attacking neural image captioning systems are inherently different from attacking CNN-based image classifiers.
Our method stands out from the well-studied adversarial learning on image classifiers and CNN models. To the best of our knowledge, this is the very first work on crafting adversarial examples for neural image captioning systems. Indeed, our Show-and-Fool algorithm 1 can be easily extended to other applications with RNN or CNN+RNN architectures. We believe this paper provides potential means to evaluate and possibly improve the robustness (for example, by adversarial training or data augmentation) of a wide range of visual language grounding and other NLP models.