Ad Lingua: Text Classification Improves Symbolism Prediction in Image Advertisements

Understanding image advertisements is a challenging task, often requiring non-literal interpretation. We argue that standard image-based predictions are insufficient for symbolism prediction. Following the intuition that texts and images are complementary in advertising, we introduce a multimodal ensemble of a state of the art image-based classifier, a classifier based on an object detection architecture, and a fine-tuned language model applied to texts extracted from ads by OCR. The resulting system establishes a new state of the art in symbolism prediction.


Introduction
Visual advertisement, both image and video, can efficiently convey persuasive messages to potential customers. Much of the power of visual advertising comes from multiple ways to interact with the user, embedding messages in both literal and symbolic forms. One of the most complex tasks in ad analysis is symbol interpretation, a much harder problem than, for instance, object detection. For example, in the LEE denim jeans brand ad shown in Fig. 1 a human figure is depicted with a lion's head, which might arguably symbolize "courage", "strong character", etc. In real life, lions are rarely seen wearing denim jeans, so analyzing this scene with a machine learning model trained on regular photographs would be quite challenging at least in terms of object detection. Moreover, establishing a direct association between the (possibly detected) lion and human "courage" is a hard problem by itself. Both factors make symbol interpretation difficult. But understanding ads computationally has important applications: decoded messages can help improve ad targeting, tools for message understanding can lead to better descriptions of visual content, and they can also inform the users how exactly they are being persuaded. Hussain et al. (2017) present a crowdsourced dataset of advertisements, including images and videos. They introduced several annotation tasks: topic detection, sentiment detection, symbol recognition, strategy analysis, slogan annotation, and question answering for texts related to the ads' messages and motivation. In this work we focus on the image-based multi-label classification task for symbols. In this problem, each annotated image in the dataset has several bounding boxes and textual description that can be mapped to a limited number of categories of symbols. Related work has mostly concentrated on this dataset, and it often combines text and images, but problem settings vary widely. In particular, the AD-VISE framework embeds text and images in a joint embedding space in the context of choosing human statements describing the ad's message . Zhang et al. (2018) introduce various feature extraction methods to capture relations between ad images and text. Ahuja et al. (2018) use a co-attention mechanism to align objects and symbols. Dey et al. (2018) suggest a data fusion approach to combine text and images for topic classification. Multi-modal image+text prediction is an important research field usually concerned with other tasks; for surveys of this field see, e.g., (Zhang et al., 2020;Xipeng et al., 2020). Tesseract (Smith, 2007) Bite the boredom, TCT am 5 bite-sized treats with a crunchy outside PUR RCo e eB Pree ee Rar ed oe 2 ; RN good EAST + Tesseract (Kopeykina and Savchenko, 2019) Bite dale boredom, unleash the fun! bite sized treats rh Pe tah outside and delicious aay filled Sita bo ii te STi) me KFc macaroni and cheese. Sl PPPS m)/7-) UC So good

Google
Android Vision API FC ma heese A Bite the boredom, unleash the fun! 5 bite-sized treats with a cr good chy and a delicious center filled with ni and creamy cheese. soft CloudVision (Otani et al., 2018) Bite the boredom, unleash the fun! 5 bitesized treats with a crunchy outside and a delicious center filled with soft macaroni and creamy cheese. KFC ma and heese good KFC AR-Net Caption a poster for the film Figure 3: Sample texts obtained via OCR/captioning. Symbols annotated by MTurkers: "Fun/Party". Hussain et al. (2017) remark that "reading the text might be helpful". Following this and the evidence provided by the winners of the Towards Automatic Understanding of Visual Advertisements challenge 1 for a different task on the same dataset (a CVPR 2018 workshop; the task was to choose the correct action-reason statements for a given visual ad), we believe that the textual and visual content of an ad often complement each other to convey the message. In this work we study how to use the text in ads for symbolism prediction along with image-based features.

Data
We have used advertisement images from the dataset by Hussain et al. (2017), annotated with the help of MTurkers. In particular, we have used symbolism annotations. In media studies, certain content that stands for a conceptual symbol is called a "signifier" (in the case of this dataset, signifiers are marked with bounding boxes on the image), and the symbols themselves are the "signified" (Williamson, 1978). Hussain et al. (2017) report that the annotators found a total of 13,938 images requiring non-literal interpretation, which was treated as an indicator that the ad contains symbolism. Following the approach of Hussain et al. (2017), we treat symbol prediction as a multi-label classification problem, using the two provided label sets with 221 and 53 labels. In the original paper, the label set has been reduced from 221 to 53 via clustering 2 , and the data was split into training/test sets as 80%:20%.

Methods
We propose a composite approach shown in general in Fig. 2, combining features mined from an ad image and from the text extracted on that image. In this section, we elaborate on the methods used on each of these feature extraction steps and in the final ensembling.
Image-Based Prediction. The image-based classifier is a central part of the proposed system. We have trained several state of the art convolutional architectures, namely MobileNet v1 (Howard et al., 2017), Inception v3 (Szegedy et al., 2015), and EfficientNet-B3 (Tan and Le, 2019), for the multi-label classification task with C labels as follows: (1) add a new head with C outputs with sigmoid activations and train this head over 5 epochs with the Adam optimizer (with frozen weights of the baseline model) for the binary cross-entropy loss; (2) train the entire model over 5 epochs with the Adam optimizer; (3) tune the entire model over 3 epochs with the SGD optimizer (learning rate 0.001). The best model, namely EfficientNet-B3, was able to obtain F1-score 0.1912 (for the label set of size 221), more than 3% higher when compared to previous state of the art (0.1579) for this task (Hussain et al., 2017). With 53 labels (clusters of symbols), EfficientNet-B3 with the same training procedure obtains F1-score of 0.2774, also exceeding previous results (0.2684) (Hussain et al., 2017). Thus, even at this level we have already exceeded state of the art by using better convolutional backbones and tuned training schedules.
Object Detection. We have also tried an approach that uses the location of signifiers (symbols) on input images to improve symbol recognition. We used the Faster R-CNN model (Ren et al., 2015) with Incep-tionResNet backbone trained on OpenImages v4 with 601 categories (Kuznetsova et al., 2018). Objects were detected in both training and validation sets, and we retained only those objects from the training set that intersect with given symbols with IoU (Intersection over Union) over 0.6 (fixed threshold); the objects were cropped and put in correspondence with symbol labels. We solved multilabel classification for symbol recognition by feeding the images of these objects into pre-trained CNN feature extractors, namely MobileNets v1/v2 and EfficientNets B0/B3 (Tan and Le, 2019), with a single dense layer on top of them for classification. The final decision for a test image was made as follows: (1) detect objects using the same Faster R-CNN; (2) extract their visual features with pre-trained CNNs; (3) classify the features with the shallow network; (4) unite its outputs (scores or predictions of symbol posterior probabilities) with non-maximal suppression, return only symbols with scores exceeding a given threshold. The F1 measure scores of this pipeline are lower than those of the previous approach. However, the methods are arguably very different, so joining both in a single ensemble might improve the results (see below).
Image-to-text: optical character recognition and captioning. We test our main hypothesis (that text is complementary to visual content) by extracting the text via the following OCR techniques: (1) open source OCR solution Tesseract (Smith, 2007), (2) an approach for text processing from (Kopeykina and Savchenko, 2019), in which the bounding boxes of text regions are obtained using the EAST text detector (Zhou et al., 2017) and the words in each region are recognized with Tesseract OCR Engine without text detection ("-oem 1 -psm 13"); (3) Google text recognition library (Google vision API for Android) available at com.google.android.gms.vision.text, following the approach by Myasnikov and Savchenko (2019); (4) OCR data from CloudVision used by Otani et al. (2018) 3 .
Our experiments indicate that OCR quality has a very significant impact on the output of the predictive system. Unfortunately, the Tesseract text detector proved to be far from accurate on the advertisement dataset. It successfully detected text on only half of the images (1182 out of 2084 validation images), much less than commercial solutions: 1739 detected by Google CloudVision API and 1894 provided by Otani et al. (2018). EAST+Tesseract approach extracted texts from 1943 validation images, but low recognition quality and arguably random text block order led to inferior prediction results; see Fig. 3 for an illustration. The best-performing texts were those obtained from Otani et al. (2018).
In a different take on extracting text, instead of OCR we have tried to predict symbols based on text obtained via image captioning. The task of generating image captions requires to produce image textual descriptions that not only express content information of the input source but also should be naturally coherent. In our case we have chosen this approach for experiments as a completely different way to obtain texts related to the images of interest (Savchenko and Miasnikov, 2020). We have used one of the best-performing image captioning models, namely AR-Net (Chen et al., 2018) pre-trained on Google's Conceptual Captions (Sharma et al., 2018). AR-Net achieves a notable improvement in captioning due to a specific regularization strategy: previous RNN state reconstruction allows the gradient and state information to propagate through time more robustly. We note that the idea of combining captions and recognized texts is potentially fruitful because these texts are usually very different. The DenseCap (Johnson et al., 2016) approach, extracting captioned bounded boxes given an image, might be especially promising. We leave this idea for further study.
Text-based models. We have applied several text classification techniques to OCR results to establish new baselines, training a BERT-based multi-label classification model using the Simple Transformers library (Rajapakse, 2020) based on the Transformers library (Wolf et al., 2019). We have compared three architectures: (1) BERT (Devlin et al., 2018); we used bert-base-uncased from Wolf et al. (2019); (2) RoBERTa (Liu et al., 2019); we used roberta-base from Wolf et al. (2019); (3) Bag-of-Ngrams baseline, tokenizing extracted texts and preserving only 10,000 most frequent unigrams and bigrams. As a multi-label classification model we have used logistic regression (Pedregosa et al., 2011), one per each label. We have also experimented with multi-output logistic regression trained on SGNS (Mikolov et al., 2013a;Mikolov et al., 2013b) and fastText (Bojanowski et al., 2016) representations, but the prediction quality was clearly worse. In preprocessing, for training we filtered out items without recognized text from the train set and lowercased the text. In the experiments, models were trained in the following settings: 15 training epochs, batch size 16, learning rate 4e-5, and other parameters at default values from Rajapakse (2020). All text-based models perform clearly inferior to image-based ones; this is natural because text is not always present, often short, and even the best OCR methods make quite a lot of mistakes. However, combining image-and text-based approaches can yield significant improvements.
Ensemble. Due to the high risk of overfitting on a small dataset, we have chosen a simple weighted sum of image-based, object detection, and text-based predictions as an ensembling strategy. For an input ad a, each model in the ensemble yields a vector f (a) ∈ [0, 1] |L| , where L is a set of labels (classes), for each data point, and the resulting ensemble outputs 1 if λ img f img (a) + λ obj f obj (e) + λ txt f txt (a) > θ, where f * (e) ∈ [0, 1] |L| are the predictions of individual models described above, and coefficients λ * , * λ * = 1, and the threshold θ are tunable parameters. For tuning, we have used the predictions of all three "elementary models" (image-, object-, and text-based) on both training and test sets, sampled (5 times) a fraction of the training set (0.1 and 0.05 for 221 and 53 labels respectively; we use the training set due to the small dataset size), then sampled λ img , λ obj and λ txt from a Dirichlet distribution and evaluated the F 1 macro (not micro as an extra measure against overfitting) on the chosen training set subsample for every θ from the set {0.0, 0.05, ...1.0}. Then we averaged the 5 sets of parameters. It order to compare against image-only baselines on the whole test set, we also trained a similar blend of image-and object-based classifiers (2 models) and used it as a backoff model. Table 1 shows the results in terms of F1 micro -scores on the test set; text-based and ensemble models are evaluated separately on images where OCR detects something (w/text) and on all ads, falling back on the 2-model ensemble when OCR does not produce anything (all, with backoff).

Results and Discussion
We see that Bag-of-Ngrams is a strong baseline in this task, in some cases even outperforming finetuned BERT and RoBERTa. The results confirm our main conclusion: while text-based models alone are hopelessly outmatched even on images with OCR-detected text, they do add a significant boost to image-based models when ensembled together. Another important point is the importance of OCR quality: significant gains are achieved only with the best OCR techniques, while adding text recognized by the basic Tesseract fails to achieve meaningful improvements. In general, Table 1 shows that we have significantly improved state of the art results in the symbolism prediction task for both label types.    Table 2 shows error analysis for sample symbols predicted by three models. First, models predict related symbols due to a vague difference (Examples 2 and 3). Second, advertisers encode general knowledge in images, e.g., a dog might be a metaphorical representation of safety (Example 1). Finally, the ads contain short text fragments that do not describe the picture in detail, e.g., bike helmets prevent head injury, while the image shows a person in hospital (Example 2).

Conclusion
We have presented a novel approach to symbol classification on multimodal advertisement data, improving upon state of the art results already with pure image-based approaches and showing further improvements with text-based methods. We have introduced linear ensembles of the developed imagebased models and text-based models that operate with OCR-extracted text, demonstrating superior performance for the ensembles and significant improvements over state of the art for both datasets (label types). Possible directions for future work include: (1) OCR postprocessing improvement: spelling correction, noise removal, etc.; (2) enhancing text-based prediction and/or object detection using thesauri, knowledge graphs, e.g., replacing specific entities with hypernyms similar to Ilharco et al. (2019) or word associations datasets, e.g., Wordgame (Louwe, 2020); (3) developing a joint architecture for images and texts: obviously, simple blending might not be the best choice for the task.