Abstractive Text-Image Summarization Using Multi-Modal Attentional Hierarchical RNN

Rapid growth of multi-modal documents on the Internet makes multi-modal summarization research necessary. Most previous research summarizes texts or images separately. Recent neural summarization research shows the strength of the Encoder-Decoder model in text summarization. This paper proposes an abstractive text-image summarization model using the attentional hierarchical Encoder-Decoder model to summarize a text document and its accompanying images simultaneously, and then to align the sentences and images in summaries. A multi-modal attentional mechanism is proposed to attend original sentences, images, and captions when decoding. The DailyMail dataset is extended by collecting images and captions from the Web. Experiments show our model outperforms the neural abstractive and extractive text summarization methods that do not consider images. In addition, our model can generate informative summaries of images.


Introduction
Summarizing multi-modal documents to get multi-modal summaries is becoming an urgent need with rapid growth of multi-modal documents on the Internet. Text-Image summarization is to summarize a document with text and images to generate a summary with text and images. The summarization approach is different from pure text summarization. It is also different from image summarization which summarizes an image set to get a subset of images.
An image worths thousands of words (Rossiter, et al., 2012). Image plays an important role in information transmission. Incorporating images into text to generate text-image summaries can help people better understand, memorize, and express information. Most of recent research focuses on pure text summarization, or image summarization. Little has been done on text-image summarization. Figure 1 and Figure 2 show an example of textimage summarization. Figure 1 is the original multi-modal news with text and images. The news has 17 sentences (with 322 words) and 4 images each of which has a caption. Figure 2 is the manually generated multi-modal summary. In the summary, the news is distilled to 3 sentences (with 36 words) and 2 images, and each summary sentence is aligned with an image.
To generate such a text-image summary, the following problems should be considered: How to generate the text part? How to measure the importance of images, and extract important images to form the image summary? How to align sentences with images?
In this paper, we propose a neural text-image summarization model based on the attentional hierarchical Encoder-Decoder model to solve the above problems. The attentional Encoder-Decoder model has been successfully used in sequence-to-sequence applications such as machine translation , text summarization (Cheng and Lapata, 2016;Tan et al., 2017), image captioning (Liu et al., 2017a), and machine reading comprehension (Cui et al., 2016).
At the encoding stage, we use the hierarchical bi-directional RNN to encode the sentences and the text document, use the RNN and the CNN to encode the image set. In the decoding stage, we combine text encoding and image encoding as the initial state, and use the attentional hierarchical decoder which attends original sentences, images and captions to generate the text summary. Each generated sentence is aligned with a sentence, an image, or a caption in the original document. Based on the alignment scores, images are selected and aligned with the generated sentences. In the inference stage, we adopt the multi-modal beam search algorithm which scores beams based on bigram overlaps of the generated sentences and the attended captions.
The main contributions are as follows: 1) We propose the text-image summarization task, and extend the standard DailyMail corpora by collecting images and captions of each news from the Web for the task.
2) We propose an RNN model to encode the ordered image set of the multi-model document as one of the initial states (the other is the text encoding) of the decoder. 3) We propose three multi-modal attentional mechanisms which attend the text and the images simultaneously when decoding. 4) Experiments show that attending images when decoding can improve text summarization, and that our model can generate informative image summaries.

Related Work
Recent research on text summarization focuses on neural methods. Attentional Encoder-Decoder model is first proposed in  and (Luond et al., 2015) to align the original text and the translated text in machine translation. The attention model is applied to sentence summarization by considering the neural language model and the attention model when generating next words (Rush et al., 2015). A selective Encoder-Decoder model that uses a selective gate network to control information from the encoder to the decoder for sentence summarization is proposed (Zhou et al., 2017).
A neural document summarization model by extracting sentences and words is proposed (Cheng and Lapata, 2016). They use a CNN model to encode sentences, and then use a RNN model to encode documents. The model extracts sentences by computing the probability of sentences belonging to the summary based on an RNN model. The model extracts words from the original document based on an attentional decoder.
An RNN-based extractive summarization named SummaRuNNer, treating summarization as a sentence classification problem is proposed (Nallapati et al., 2016). A logistic classifier is then applied using features computed based on the RNN model. A hierarchical Encoder-Decoder model, conserving the hierarchical structure of documents is proposed (Li et al., 2015). A graph-based attentional Encoder-Decoder model using a PageRank algorithm to compute the attention is proposed (Tan et al., 2017).
Image captioning generates a caption for an image. Text-image summarization is similar to image captioning in that both utilize image information to generate text. Images are encoded with CNN models such as VGGNet (Simonyan and Zisserman, 2014), AlexNet (Krizhevsky et al., 2012) and GoogleNet (Szegedy et al., 2014) by extracting the last full-connected layers. An attentional model is used in image captioning by splitting an image into multiple parts which is attended in the decoding process (Xu et al., 2015). Image tags was used as additional information, and semantic attention model which attends image tags when decoding was proposed (You et al., 2016). The attention-based alignment of image parts and text is studied (Liu et al., 2017a), and the results show that the alignments is in high accordance with manual alignments. An image to an ordered recognized object set is encoded, and the attentional decoder is applied to generate captions (Liu et al., 2017b).
Multi-modal summarization summarizes text, images, videos, and etc. It is an important branch of automatic summarization. Traditional multimodal summarization inputs multi-modal documents or pure text documents, and outputs multi-modal documents (Wu, 2011;Greenbacker, 2011;Yan, 2012;Agrawal, 2011;Zhu, 2007;UzZaman, 2011). For example, Yan et al., (2012) generate multi-modal timeline summaries for news sets by constructing a bi-graph between text and images, and apply a heterogeneous reinforcement ranking algorithm. Strategies to summarizing texts with images and the notion of summarization of things are proposed in (Zhuge, 2016). The deep learning related work ) treats text summarization as a sentence recommendation task and applies matrix factorization algorithm. They first retrieve images from Yahoo!, use the CNN to extract image features as the additional information of sentences, use Rouge maximization as the training object function which are trained with SGD. In test time, sentences are extracted based on the model and images are retrieved from the Search Engine. Figure 3 shows the framework, a multi-modal attentional hierarchical encoder-decoder model. The hierarchical encoder-decoder is proposed in (Li et al., 2015) and extended by (Tan et al. , 2017) for document summarization through bringing in the graph-based attentional model.

Method
Our model consists of three parts: a hierarchical RNN to encode the original sentences and the captions, a CNN+RNN encoder to encode the image set, and a multi-modal attentional hierarchical RNN decoder.
The input of our model is a multi-modal document MD = {D, PicSet}, where D is the main text of the multi-modal document and PicSet is the image-caption set ordered by the occurring order of images in the document.

Main Text Encoder
The main text D consists of sentences, each of which consists of words. Let D=[s 1 , s 2 , …, s |d| ], and ,1 ,2 , [ , , , ] embedding of the j th word in the s i . We use word2vec (Mikolov et al., 2013) to create word embeddings. GRU is used as the RNN cell . We use a hierarchical RNN encoder to encode the main text D to vector representation. The sentence encoder is adopted to encode sentences to vector representations. An <eos> token is appended to the end of each sentence. A bidirectional RNN is used as the sentence encoder: where enc sent i denotes the vector representation of s i . It is the concatenation of ,1 We use enc sent i as inputs to the document encoder to encode the main text to vector representations. A bi-directional RNN is adopted as the document encoder: where enc doc denotes the vector representation of the D, and h i is the concatenated hidden state of s i .

CaptionSet and ImageSet Encoder
The ordered image-caption set PicSet consists of an ordered image set and an ordered caption set which are ordered by the occurring order in the multi-modal document. The image occurring order makes sense because images are often put near the most related sentences, and the sentences have strict order in the document. We treat the ordered caption set as a document, and apply the sentence encoder and the document encoder to the caption document. Then, we get the hidden state h cap i and the vector representation enc cap of the caption document.
We use the CNN model to extract the vector representation of each image, and then use the RNN model to encode the ordered image set to vector representation. The CNN model we adopted is 19-layer VGGNet (Simonyan and Zisserman, 2014). We drop the last dropout layer and keep the last full-connected layer as the image's vector representation, the dimension of which is 4096.
We then use a bi-directional RNN model to encode the ordered image set and the image features are used as inputs of the RNN model.
where img fea i is the vector representation of img i , enc img is the vector representation of the image set, and h img i is the hidden state of img i when encoding the image set.
To our best knowledge, we are the first to adopt the RNN model to encode the image set.

Decoder
In the decoding state, we adopt the hierarchical RNN decoder to generate text summaries. Equation (12) to (16) are the equations of the hierarchical decoder which consists of a sentence decoder and a word decoder.
Equation (12) computes the initial state 0 h for the sentence decoder by combining the decoding of the main text information and the decoding of the image information of the multi-modal document. To represent image information, we can use both of the image set decoding and the caption set decoding, or only use one of them, depending on the multi-modal attention mechanism introduced in the next subsection.
The sentence decoder uses a two-level hidden output model  to generate the representation of the next sentence through equation (13) and equation (14). The two-level hidden output model consistently improves the summarization performance on different datasets . In equation (14) The word decoder uses the sentence representation generated by the sentence decoder as the initial state, and use the <sos> (start of sentence) token as the initial input. Equation (15) and equation (16) generate the next hidden state and the next word. The output of the word decoder in the first step is a switch sign which is either <neod> token or <eod> token. The token <neod> means "not end of document", and the token <eod> means "end of document". If the first output is <eod>, the whole decoding process is finished. If the first output is <neod>, the token is used as the next input of the word decoder. The word decoding process is finished when it generates the <eos> token. The last hidden state of the word decoder is treated as the vector representation of the generated sentence and is used as next input of the sentence decoder.

Multi-Modal Attention
We propose three multi-modal attention mechanisms to compute the sentence decoding context i c .
Traditional attention mechanisms for text summarization computes the importance score of the sentence s j in the original document based on the relationship between the decoding hidden state  i h and the original sentence encoding hidden state j h . We call the traditional attention model as Text Attention (attT for short), which is computed by equation (17), (18) and (19): is the normalized attention, c h is the context. The problem is that the multi-modal document has images and captions besides the main text. Therefore, we propose three multi-modal attention mechanisms which take images and captions into consideration.
Text-Caption Attention (attTC for short). This attention model uses captions to represent the image information. attTC computes the attention score of the caption cap j based on the relationship between the caption encoding hidden state h cap i and the decoding hidden state  j h . Text-Image-Caption Attention (attTIC for short). This attention model uses both captions and images to represent the image information. attTIC computes the importance score of the caption cap j and the importance score of the image img j simultaneously, and then compute the context of the decoding hidden state  i h using equation (25). The initial state of the decoder is computed by Equation (12) which can be adjusted according to different attention models.

Model Training
Since there are no existing manual text-image summaries, and most of the existing training and testing data have pure text summaries, we decide to use pure text summaries as training data to train our models. The sentence-image alignment relationships can be discovered through training the multi-modal attention models.
The loss function L of our summarization model is the negative log likelihood of generating text summaries over the training multi-modal document set MDS. ( , , ) log ( | , ) where Y=[y 1 , y 2 , …, y |Y| ] is the word sequences of the summary corresponding to the main text D and the ordered image set PicSet, including the tokens <eos>, <neod> and <eod>. We use the Adam (Kingma and Ba, 2014) gradient-based optimization method to optimize the model parameters.

Multi-Modal Beam Search Algorithm
There are two major problems of the generation of summaries: one is the out-of-vocabulary problem, and the other is the low quality of the generated texts including information incorrectness and repetitions.
For the OOV problem, we use the words in the attended sentences or captions in the original document to replace OOV tokens in the generated summary. Previous research uses the attended words to replace OOV tokens in the flatten encoder-decoder model which attends the words of the original word sequence (Jean, et al., 2015). Our model is hierarchical and multi-modal, and attends sentences, images, and captions when decoding. We use the following algorithm to find the replacement for the j th OOV in a generated sentence: Step 1: Order the original sentences and captions by the attending scores in descending order.
Step 2: Return the j th OOV word in the ordered sentences and captions as the replacement.
For the attTI mechanism that attends images neglecting captions, we use captions instead of the attended images in the algorithm.
For the low-quality generated text problem, we adopt the hierarchical beam search algorithm (Tan el al., 2017). We extend the algorithm by adding caption-level and image-level beam search. The multi-modal hierarchical beam search algorithm comprises K-best word-level beam search and Nbest sentence-caption-level beam search. In particular, we use the corresponding captions instead of images in beam search algorithm for the attTI mechanism which attends images.
At the word-level search algorithm, we compute the score of generating word y t using equation (28) where ref is a function calculating the ratio of bigram overlap between two texts, s * is the attended sentence or caption, and γ is the weighting factor. The added term aims to increase the overlap of the generated summary and the original text.
At the sentence level and the caption level, we set the sentence beam width as N, and keep N-best previously un-referred sentences or captions which have highest attending scores. For each sentence beam, we try M sentences or captions and keep the one achieving best word-level scores.

Image Selection and Alignment
We rank the images, select several most important images as the image summary, and align each sentence with an image in the image summary. The score of images is computed by equation (29).
where α i,j is the attention score of the j th image when generating the i th sentence of the text summary, and |TextSum| is the number of summary sentences. The images are ranked by the scores in descending order, and the top K images are selected to form the image summary ImgSum. We align each sentence i in TextSum to the image j in ImgSum such that α i,j is the biggest.

Data preparation
We extend the standard DailyMail corpora through extracting the images and the captions from the html-formatted documents. We call the corpora as E-DailyMail. The standard DailyMail and CNN datasets are two widely used datasets for neural document summarization, which are originally built in (Hermann et al., 2015) by collecting human generated highlights and news stories from the news websites. We only extend the DailyMail dataset because it has more images and is easier to collect than the CNN dataset does. We find that the text documents provided by the original DailyMail corpora contain captions. This is due to that all related texts are extracted from the html-formatted news when the corpora are created. We keep the original text documents unchanged in E-DailyMail. The split and statistics of E-DailyMail are shown in Table 1.

Implementation
We preprocess the text of the E-DailyMail corpora by tokenizing the text and replacing the digits with the <NUM> token. The 40k most frequent words in the corpora are kept and other words are replaced with OOV.
Our model is implemented by using Google's open-source seq2seq-master project written with Tensorflow. We use one layer of the GRU cell. The dimension of the hidden state of the RNN decoder is 512. The dimension of the word embedding vector is 128. The dimension of the hidden state of the bi-directional RNN encoder is 256. We initialize the word embeddings with Google's word2vec tools (Mikolov et al., 2013) trained in the whole text of DailyMail/CNN corpora. We extract the 4096-dimension full-connected layer of 19-layer VGGNet (Simonyan and Zisserman, 2014) as the vector representation of images. We set the parameters of Adam to those provided in (Kingma and Ba, 2014). The batch size is set to 5. Convergence is reached within 800k training steps. It takes about one day for training 40k ~ 50k steps depending on the models on a GTX-1080 TI GPU card. The sentence beam width and the word beam width are set as 2 and 5 respectively. M is set as 3. The parameter γ is set as 3 or 300 tuned on the validation set.
To train the multi-modal attention mechanism such as attTIC, we concatenate the matrix of text representations, image representations, and caption representations to one matrix M

Evaluation of Text Summarization
The widely used ROUGE (Lin, 2004) is adopted to evaluate text summaries.
HNNattT is similar to the model introduced in (Tan et al., 2017) without the graph-based attention. We compare our models with HNNattT to show the influence of multi-modal attentions. The first 4 lines in Table 2 are the results with summary length of 75 bytes. The results show that HNNattTI has considerable improvement over HNNattT. An interesting observation is that HNNattTC and HNNattTIC are not better than HNNattT. One of the reasons is that the text documents provided by the DailyMail corpora contain captions. Captions are already parts of the text documents. The other reason is that captions distract attentions and cannot attract sufficient attentions from the original sentences, which will be discussed in the next subsection.
We compare our methods with state-of-the-art neural summarization methods reported in recent papers on the DailyMail corpora. Extractive models include Lead which is a strong baseline using the leading 3 sentences as the summary, NN-SE (Cheng and Lapata, 2016), and SummaRuNNer-abs (Nallapati et al., 2017) which is trained on the abstractive summaries. Abstractive models include NN-ABS, NN-WE, LREG, though they are tested on 500 samples of the test set. LREG is a feature-based method using linear regression. NN-ABS is a simple hierarchical extension of (Rush et al., 2015). NN-WE is the abstractive model restricting the generation of words from the original document. The results are shown in the last 6 rows in Table 2. Our method HNNattTI outperforms the three extractive models and the three abstractive models.
We compare our models under the full-length F1 metric by setting the γ value as 300. According to (Tan et al., 2017), a large γ makes the generated summary has more overlaps with the attended texts, and thus partly overcome the repeated sentences problem in the generated summary. We do not incorporate the attention distraction mechanism  into our model, because we want to focus on our own model to see whether considering images improves text summarization. Results in Table 3 also show that HNNattTI performs better than HNNattT, HNNattTC, and HNNattTIC.
To show the influence of our OOV replacement mechanism, we eliminate the mechanism from our models, and show the evaluation results in Table 4 and Table 5. We can see from the two tables that the scores are lower than the corresponding scores in Table 2 and Table 3. Our OOV replacement mechanism improves the summarization models, though the mechanism is relatively simple.
In short, combining and attending images in the neural summarization model improves document summarization.

Evaluation of Image Summarization
To evaluation the image summarization, the gold standard image summary is generated based on a greedy algorithm on the captions as follows: at each time i, choose img k to maximize      We use the 1-image and 2-image random selected image summaries as the baselines which we compare our models with. The top 1 or 2 images ranked by our model are selected out to form the summaries. Results in Table 4 show that HNNattTI outperforms the random baseline, while HNNattTC and HNNattTIC perform worse. This implies that attending images can generate better sentence-image alignment in the multimodal summaries than the model attending captions does. And this can also partly explain why our summarization model attending images when decoding can generate better text summaries than the one attending captions does. Figure 4 shows the text-image summary of the example demonstrated in Figure 1 generated by the HNNattTI model. In the summary, there are 2 images and 3 generated sentences, and each sentence is aligned with an image. The image summary has one common image with Figure 2.

Instance
The sentences are named by S1, S2, and S3 respectively. Table 7 shows the sentence-image alignment scores. The four images in the original document are numbered from top to bottom and left to right by IMG1, IMG2, IMG3, and IMG4. The summation of alignment scores for a summary sentence is less than 1, because the sentence is also aligned with the sentences in the original document.

Conclusions
This paper proposes the text-image summarization task to summarize and align texts and images simultaneously. Most previous research summarizes texts and images separately, and few has been done on text-image summarization. We propose the multi-modal attentional mechanism which attends original sentences, images, captions simultaneously in the hierarchical encoder-decoder model, use the RNN model to encode the ordered image set as the initial state of the decoder, and propose the multimodal beam search algorithm which scores beams using the bigram overlaps of the generated sentences and the captions. The model is trained by using abstractive text summaries as the targets, and the attention scores of images are used to score images. The original DailyMail dataset is extended by collecting images and captions from the Web. Experiments show that our model attending images outperforms the models not attending images, three existing neural abstractive models and three existing extractive models. Experiments also show our model can generate informative summaries of images.  Table 7: The sentence-image alignment scores of the generated summary for the news in Figure 1.