Image Caption Generation for News Articles

In this paper, we address the task of news-image captioning, which generates a description of an image given the image and its article body as input. This task is more challenging than the conventional image captioning, because it requires a joint understanding of image and text. We present a Transformer model that integrates text and image modalities and attends to textual features from visual features in generating a caption. Experiments based on automatic evaluation metrics and human evaluation show that an article text provides primary information to reproduce news-image captions written by journalists. The results also demonstrate that the proposed model outperforms the state-of-the-art model. In addition, we also confirm that visual features contribute to improving the quality of news-image captions.


Introduction
Image captioning, i.e., automatic generation of a natural language description from an image, has received much attention from both fields of Computer Vision (CV) and Natural Language Processing (NLP) (Vinyals et al., 2015;Xu et al., 2015;Karpathy and Fei-Fei, 2015). Image captioning is important not only in practical application (e.g., automatic indexing of images) but also as a challenge of image understanding (e.g., recognition of objects, object-object relationships, and scenes).
In this paper, we address a more advanced task, news-image captioning: this task generates a description of an image, given the image and its article body as input. The news-image captioning task is different from the conventional image captioning, which receives only an image as input. In other words, news-image captioning requires a mutual understanding of image and text.
Early work proposed a two-stage approach for news-image captioning (Feng and Lapata, 2013;Tariq and Foroosh, 2017). The first stage annotates keywords to a given image and text, and the second stage realizes a description based on the extracted keywords. Recent work presented an end-to-end approach that integrates image and text features in deep neural networks (Ramisa et al., 2018;Batra et al., 2018;Biten et al., 2019). However, the previous studies did not focus on the usefulness of text in the news-image captioning task, extending the conventional models for image captioning to incorporate text features. Figure 1 shows an example of a news-image caption. It may be difficult to recognize which is the central object in the image, for example, people, violins, and stick (bow). Also, the caption includes much information (e.g., Juilliard Orchestra, Vladimir Jurowsky, and Alice Tully Hall) that may be hard to tell only from the image. This kind of example is rather common in news articles, where a text is the major medium of information, and an image and its caption provide additional explanations that support the text. However, no previous work explored a method that integrated an article text seamlessly with an image in the task of news-image captioning.
In this paper, we present a method for news-image captioning based on Transformer (Vaswani et al., 2017), a successful architecture for various NLP tasks, including machine translation, abstractive summarization, contextualized word embeddings. We propose a Transformer model that integrates text and image modalities and attends to textual features from visual features in generating a caption. The experimental results show that an article text provides primary information to reproduce news-image captions written by journalists. The results also demonstrate that the proposed model outperforms the state-ofthe-art model (Biten et al., 2019). We report the results of human evaluation and discuss challenges in news-image captioning. The dataset and code used in this work are publicly available 1 .

Multimodal Transformer Model
Given an image i and a text as a sequence of n tokens (x 1 , x 2 , . . . x n ), the task of news-image captioning generates a caption as a sequence of m tokens (y 1 , y 2 , . . . y m ). As the base architecture for realizing the task, we use the Transformer model (Vaswani et al., 2017), which has been a popular architecture for headline generation for news articles (Takase and Okazaki, 2019;Duan et al., 2019;Dong et al., 2019). Because headline generation is a sequence-to-sequence task from (x 1 , . . . x n ) to (y 1 , . . . y m ), we consider incorporating features from an image i to the architecture. Figure 2 illustrates the proposed model. The model consists of image encoder, image-article encoder, and decoder. The image encoder converts an image i into a feature vector p i ∈ R d , (1) Here, d presents the number of dimensions of hidden feature vectors, and CNN(.) is a Convolutional Neural Network to convert an image into a feature vector. In this study, we compare two CNN models trained on the datasets of object recognition and scene recognition as CNN(.). The image-article encoder computes joint representations for an input image and text. The input to the first layer of the image-article encoder is a d × n matrix, whose column vectors present an addition of token embedding and positional encoding. The image-article encoder stacks N layers of two submodules: encoder module and image-attending module. The encoder module is identical to the encoder component of the original Transformer model: it is a mapping from R d×n to R d×n by using a multi-head self-attention followed by a feed-forward layer.
The image-attending module is similar to the decoder component of the original Transformer model: it is a mapping from R d×n to R d×n by using a multi-head target-source attention followed by a feedforward layer. Here, we provide the image vector p i as keys and values for the scaled dot-product attention so that this module can build intermediate representations by attending to the input image. The image-attending module adds the outputs from the encoder module and the target-source attention via the residual connection. Thus, we expect that target-source attention can focus on implicit contextual correlations between visual and textual features, modifying input representations based on the input image.
The decoder has M layers, each of which is exactly the same as the original Transformer decoder. A decoder layer consists of multi-head masked self-attention over output tokens, followed by the targetsource attention that receives the encoder output as keys and values of the scaled dot-product attention. In

Experiments
The questions we want to explore in this section are two-folds: • Can the multimodal Transformer model presented in Section 2 successfully integrate visual and textual features?
• Are image captions mostly influenced by textual or visual features in news articles?
To answer these questions, we select the news-image captioning dataset: GoodNews, and present fixes for its issues in Section 3.1. After explaining the experimental settings (in Sections 3.2 to 3.5), we report evaluation results by using the automatic evaluation (in Section 3.6) followed by the human evaluation (in Section 3.7).

Dataset for News-image Captioning
Biten et al. (2019) released the GoodNews dataset, which is the largest dataset for news-image captions so far. We considered using this dataset but found some issues. The biggest issue was that many instances in the dataset contain incomplete text. More concretely, many instances lack several leading sentences, which usually convey essential information on news articles. The issue is critical to our work, which aims to explore the use of vision and textual information in news-image captioning. Also, Biten et al.
(2019) split the dataset into training and test sets at the image level; for this reason, the same news text may appear in both the training and test sets.
To solve these issues, we crawled complete text for each news article and split the dataset randomly at the article level. Our dataset for news-image captions includes 269K articles and 489K images in total. An article contains 1.8 images on average, and 59% of articles have only one image. The average lengths of body text, headlines, and image captions of the articles are 963.29, 8.57, and 17.55 words, respectively. We split the dataset into 245K articles for the training set, 10K for the validation set, and 13K for the test set.

Data Preprocessing
Some articles are too long to store them in the GPU memory, and a headline and leading paragraphs of an article provide the most useful information, we truncated each news article in the dataset to keep the first 416 words at most (including the headline). In addition to truncating articles, we also removed punctuation and non-ASCII characters from text. We left a dot '.' as a delimiter of sentences. Applying the algorithm of Byte-Pair-Encoding (BPE) implemented in SentencePiece 2 (Kudo and Richardson, 2018) to the preprocessed news articles of the training set, we built a vocabulary of 32,000 subwords.

Baselines and Model Variants
Biten et al. (2019) presented a model that achieved the state-of-art performance on the GoodNews dataset. Using the publicly-available implementation 3 , we trained their models on our dataset with six different settings: Avg + AttIns, Avg + CtxIns, TBB + AttIns, TBB + CtxIns, Wavg + AttIns, and Wavg + CtxIns. These settings are described in detail in Biten et al. (2019). For the sake of clarity, we briefly explain some keywords: Avg: article embeddings computed by averages of GloVe word vectors for each sentence; Wavg: article embeddings computed by weighted averages of GloVe word vectors for each sentence; TBB: article embedding computed by the tough-to-beat baseline (Arora et al., 2019); AttIns: named entity insertion guided by the attention mechanism over the article; and ctxIns: context-based named entity insertion.
To compare the importance of visual and textual features for news-image captioning, we also prepared variants of the proposed model. In this experiment, we focused on visual features trained for two different tasks: object recognition (ImageNet) and scene recognition (Places 365). We used ResNet-18 (He et al., 2016) pre-trained on ImageNet and Places365 to obtain visual features from an image.
In addition to these variants, we also included two simple baselines: Lead and Headline assume lead sentences and headlines as image captions. In other words, these baselines reveal the similarity between image captions and lead/headline sentences.

Implementation and Training
We implemented all Transformer models using PyTorch (Paszke et al., 2019) based on Fairseq (Ott et al., 2019).  Training We used Adam (Kingma and Ba, 2015) with β 1 = 0.9, β 2 = 0.98, = 10 −8 for parameter estimation. We linearly increased the learning rate from initial rate 10 −7 until 0.0005 in the 4,000 warmup steps, and then decreased it proportionally to the inverse of the square root of the step number with the minimum learning rate of 10 −9 . The objective function is the cross-entropy loss with label smoothing of 0.1 (Szegedy et al., 2016). In each layer of the model, we applied dropout (Srivastava et al., 2014) with the rate of 0.3 after the layer normalization and before the residual connection. In both the encoder and decoder, we also applied dropout with the rate of 0.3 after taking the sums of token embeddings and positional encoding. We also applied dropout with the rate of 0.1 to the attention weights. Training all models for 50 epochs, we stored model parameters that yielded the minimum loss. It took about 1.2 days to train a Multimodal Transformer model on four NVIDIA Tesla V100 for NVLink (16GiB HBM2).

Evaluation Metrics
We used five automatic evaluation metrics: BLEU (Papineni et al., 2002), METEOR (Denkowski and Lavie, 2014), ROUGE (Lin and Och, 2004), CIDEr (Vedantam et al., 2015), and SPICE (Anderson et al., 2016). Because news-image captions contain a fair amount of named entities, we considered CIDEr as the most appropriate and informative metric among these five metrics. We used the MS-COCO caption evaluation tool 6 after lower-casing and removing the punctuation in captions.
Both news articles and news-image captions contain a certain amount of named entities, and these entities often carry relevant contextual knowledge. Therefore we introduced Coverage NE to measure the coverage of named entities in generated captions. The coverage for a pair of generated and ground-truth captions is defined as recall of named entities in the captions: Here, E generated and E gold present sets of named entities in a generated and ground-truth captions, respectively. We used SpaCy 7 to identify named entities in the ground truth captions, and applied regular expressions to find exact matches in generated captions.    Table 2 reports results of automatic evaluation for baseline models, variants of Transformer-based models, and the state-of-the-art model (Biten et al., 2019) on our dataset. Multimodal Transformer (Im-ageNet) yields the best CIDEr score. All Transformer-based models using textual features substantially outperform the state-of-the-art model (Biten et al., 2019).

Results (Automatic Evaluation)
The Headline baseline was roughly comparable to the state-of-the-art model (Biten et al., 2019). Just as the task of headline generation, the Lead method exhibited the performance of a strong baseline for news-image captioning. This is probably because images and captions in news articles may have the role of indicative summaries, i.e., receiving attentions from readers and inviting readers to the articles. In other words, journalists may write an image caption as a summary of an article, expecting the image and caption to be an alternative starting point to the article.
Transformer (Text) was also a strong baseline for news-image caption generation, which outperformed Transformer (Image) models even without looking at actual images. This result indicates that newsimage captions describe images and include much information mentioned in the text. The huge difference in CIDEr scores of Transformer (Text) and Transformer (Image) implies that the vision-only models suffered when it comes to named entities.
In the meantime, we could observe an improved performance in Multimodal Transformer models compared to the Transformer (Text). This result demonstrates that visual features are also useful in generating captions. Multimodal Transformer (Places 365) performed the best in BLEU, and Multimodal Transformer (ImageNet) achieved the best in ROUGE-L, CIDEr and SPICE. Mixed two types of visual features yielded a slight performance deterioration (Multimodal Transformer (ImageNet & Places 365)). Notably, performance differences are rather small among Multimodal Transformer models. In Appendix A, we will present a case study of captions generated from Multimodal Transformer models and Transformer (Text) model. Figure 3 shows the distribution of Coverage NE scores of each model: the x-axis of the graph presents a range of (100 × Coverage NE ) scores, and the y-axis presents the proportion of test instances for each score range. The graph indicates that the Transformer models given the article text were able to include much more correct named entities than other models. With no access to article text, it is quite natural that the Transformer (Image) models could not include correct entities. The previous state-of-the-art model (Biten et al., 2019) is somewhere between the Transformer (Image) models and the Multimodal Transformer models.

Model
Grammaticality  Table 3: Average scores of human evaluation for three representative models.
Because we are unsure of the appropriateness of the automatic evaluation in this task, we also conducted a human evaluation. We asked three native English speakers to evaluate generated captions from three models: Multimodal Transformer (Places 365), Transformer (Text), and Biten et al. (2019) (Att + AttIns).
We randomly chose an evaluation set with 136 images. Each instance in this evaluation set included the image, news article, ground truth caption, and generated captions from the three models. Three generated captions were presented to human subjects in random order so that they were not able to guess the quality of a caption from the appearance order. We designed four criteria for rating generated captions: grammaticality, faithfulness, descriptiveness, and overlap.
• Descriptiveness: The caption explains the image (3); the caption does not explain the image, but describes something related to the image (2), or the caption is totally unrelated to the image (1).
Note that human evaluation is not easy. There is no guarantee that human evaluators are familiar with objects and scenes (e.g., people, building, location) appearing in the news. Although a news article (text) provides a hint for interpreting an image, they may find it hard to search for evidence of the image on the Internet. Therefore we always presented the ground-truth captions to human subjects to help them understand images. Table 3 shows the average score assigned to each model and criterion. The two Transformer models outperformed the previous state-of-the-art model on the four criteria. In particular, Biten et al. (2019) (Avg + AttIns) suffered from the low score of grammaticality. However, there was no clear winner between the two Transformer models in the human evaluation. We could observe that the ranking of the overlap criterion was consistent with those of the automatic evaluation metrics used in Section 3.6. This is reasonable because the overlap criterion is a manual version of the automatic metrics. Multimodal Transformer (Places 365) has an advantage in the descriptiveness criterion, but the score difference from Transformer (Text) is small. The overall low scores also imply that news-image captioning remains challenging, especially for incorporating visual information from images.

Related Work
Image Captioning Recent advances in image captioning followed the encoder-decoder architecture, where the encoder extracts visual features from images, and the decoder generates the caption (Xu et al., 2015;Vinyals et al., 2015;You et al., 2016;Li et al., 2017). Notably, the attention mechanism further improved the quality of generated captions by finding implicit alignments between visual and textual features. These models achieved good performance in generating captions at a descriptive level with a consistent writing style. Different from generic image captions, news-image captions are often influenced by contextual information from the news article.
News-image captioning News-image captioning has been paid little attention in the literature. Early work (Feng and Lapata, 2013;Tariq and Foroosh, 2017) presented models with two stages: the first stage is an annotation model that suggests keywords for an image; and the second stage realizes sentences based on the extraction result of the first stage. Recent studies departed from these approaches by utilizing the deep neural network to find the implicit image-text correlations (Batra et al., 2018;Ramisa et al., 2018).
Biten et al. (2019) is the most recent method yielding the state-of-art performance on the Goodnews dataset. The method consists of two stages: in the first stage, long short-term memory (LSTM) combines textual features at sentence-level and visual features at object-level to generate a template sentence. The second stage then inserts named entities into the placeholder of the template sentence to realize captions with named entities. In contrast, our method required no template sentence and included named entities directly from article text based on the Transformer architecture.

Conclusion
In this paper, we presented a method for news-image captioning based on the Transformer model, which integrates text and image modalities and attends to textual features from visual features in generating a caption. The experimental results demonstrated that the proposed model could integrate visual and textual information in generating captions. Meanwhile, we also found that news-image captioning as a context-oriented image captioning task, text from news article fundamentally contributes our model in generating context-coherent captions, with the engagement of image further improves qualities.
In the future, we would like to explore a better approach for recognizing visual contents from images to improve the quality of generated captions. Besides, we are also interested in whether a model of news-image captioning is transferrable to other multimodal tasks such as multimodal translation and visual question answering. Figure 4: Captions generated by the Transformer models. In (a), Transformer (Text) made the correct prediction for the person in the image (the prime minister of Japan). Multimodal Transformer models injected the correct visual information (news conference) into the caption. In (b), all four models failed to generate the correct caption. Transformer (Text) predicted the correct name but wrong contextual information. Multimodal Transformer models generated captions with a different focus.