Multimodal Topic Labelling

Topics generated by topic models are typically presented as a list of topic terms. Automatic topic labelling is the task of generating a succinct label that summarises the theme or subject of a topic, with the intention of reducing the cognitive load of end-users when interpreting these topics. Traditionally, topic label systems focus on a single label modality, e.g. textual labels. In this work we propose a multimodal approach to topic labelling using a simple feedforward neural network. Given a topic and a candidate image or textual label, our method automatically generates a rating for the label, relative to the topic. Experiments show that this multimodal approach outperforms single-modality topic labelling systems.


Introduction
LDA-style topic models (Blei et al., 2003) are a popular approach to document clustering, with the "topics" (in the form of multinominal distributions over words) and topic allocations per document (in the form of a multinomial distribution over the topics) providing a powerful document collection visualisation, gisting and navigational aid (Griffiths et al., 2007;Newman et al., 2010a;Chaney and Blei, 2012;Sievert and Shirley, 2014;Poursabzi-Sangdeh et al., 2016).
Given its internal structure, an obvious way of presenting a topic t is as a ranked list of the highest-probability terms w i based on Pr(w i |t), often simply based on a fixed "cardinality" (i.e. number of topic words) such as 10. However, this has a number of disadvantages: (a) there is a cognitive load in forming an impression of what concept the topic represents from its topic words (Ale-tras et al., 2014;Aletras et al., 2017); (b) there is a potential bias in presenting the topic based on a fixed cardinality ; and (c) it can be hard to interpret mixed or incoherent topics (Newman et al., 2010b). Automatic topic labelling methods have been proposed to assist with topic interpretation, e.g. based on text (Lau et al., 2011;Bhatia et al., 2016) or images (Aletras and Stevenson, 2013;Aletras and Mittal, 2017), with recent work showing that the optimal modality (i.e. text or image) for topic labelling varies across topics (Aletras and Mittal, 2017).
The focus of this paper is the automatic rating of a textual or image label for a given topic. Our contributions are as follows: 1. we develop and release a novel topic labelling dataset with manually-scored image and text labels for a diverse set of topics; one particular point of divergence from other text-image datasets is that text and image labels are rated on a common scale, and the optimal modality (text vs. image) for a given topic input must be selected as part of the output; and 2. we propose two deep learning approaches to automatically rate multimodal topic label candidates, which we show to outperform single-modality topic labelling benchmarks. The code and dataset associated with this paper are available at: https://github.com/ sorodoc/multimodal_topic_label.

Related work
Topic labelling methods usually involve two main steps: (1) the generation of candidate labels (e.g. text or images) for a given topic; and (2) the ranking of candidate labels by relevance to the topic. Textual labels have been sourced from in a number of different ways, including noun chunks from a reference corpus (Mei et al., 2007), Wikipedia ar-ticle titles (Lau et al., 2011;Bhatia et al., 2016), or short text summaries (Cano Basave et al., 2014;Wan and Wang, 2016). Images are often selected from Wikipedia or the web based on querying with topic words (Aletras and Stevenson, 2013;Aletras and Mittal, 2017). Recent work on topic labelling has shown that text or image embeddings can improve candidate label generation and ranking (Bhatia et al., 2016;Aletras and Mittal, 2017). Bhatia et al. (2016) use word2vec (Mikolov et al., 2013) and doc2vec (Le and Mikolov, 2014) to represent topics and candidate textual labels in the same latent semantic space. The most relevant textual labels for a topic are selected from Wikipedia article titles using the cosine similarity between the topic and article title embeddings. Finally, top labels are re-ranked in a supervised fashion using various features such as the PageRank score of the article in Wikipedia (Brin and Page, 1998), trigram letter ranking (Kou et al., 2015), topic word overlap, and word length of the label. Aletras and Mittal (2017) use pre-computed dependency-based word embeddings (Levy and Goldberg, 2014) to represent the topics and the caption of the images, as well as image embeddings using the output layer of VGG-net (Simonyan and Zisserman, 2014) pretrained on Im-ageNet (Deng et al., 2009). A concatenation of these three vectors is the input to a simple deep neural network with four hidden layers and a sigmoid output layer to predict the relevance score.
Textual or visual modalities for labelling topics have been studied extensively, although independently from one another. Our work differs from the single-modality methods described above in that it uses a joint model to predict the continuousvalued rating for both textual and image labels. This is, to the best of our knowledge, the first attempt at joint multimodal topic labelling.

Dataset
Several annotated datasets have been developed in previous work for topic labelling, although they have been based on a particular label modality (i.e. text or images). For example, Aletras and Stevenson (2013) used topics generated from New York Times articles and collected image labels with human ratings, while Bhatia et al. (2016) Table 1: Example of a topic and its textual and image labels.
mains. The topics of these different datasets do not overlap, and as such have little utility for our multimodal method. To this end, we develop a new dataset which contains human-assigned ratings for two topic label modalities (textual and image) for the same set of topics. We build on the dataset of Bhatia et al. (2016), which has ratings for textual labels. This dataset contains 228 topics generated from 4 different domains: BLOGS, BOOKS, NEWS and PUBMED. 1 Each topic has 19 textual labels which were rated by human judges on a scale of 0-3, where 0 represents a poor label and 3 indicates a perfect label. We chose this dataset due to the diversity of sources represented in the topics.
We use the 228 topics and generate image labels for each topic following the method of Aletras and Stevenson (2013). 2 We follow the annotation approach of Bhatia et al. (2016), collecting ratings based on an ordinal scale of 0-3. We use Amazon Mechanical Turk to crowdsource the ratings, and have each image labelled by 8 workers. To aggregate the ratings for a label, we compute its mean rating.
For quality control, we embedded a bad label into the HIT for each topic by sampling a label candidate for a topic from a different domain, under the assumption that an out-of-domain label is highly unlikely to be appropriate. Workers who rate these control labels greater than 1 are  recorded, and those who fail more than 50% of control labels are filtered out of the dataset.
In total, 353 turkers participated in the image labelling task, at an average error percentage of 16% (based on the control images). A total of 42 turkers were filtered out, on the basis of having an error rate of more than 50%.
An example of a topic and its image and textual labels, and their associated mean ratings, is presented in Table 1. The mean rating for the textual labels is 1.57 with a variance of 0.29, while the mean rating for the image labels is 1.84 with a variance of 0.51. That is, the image labels are, on average, better quality, but there is equally more variability among the image labels.
To summarise, our final dataset consists of 4560 images and 4332 textual labels for 228 topics (20 images and 19 textual labels for each topic). To the best of our knowledge, it is the first dataset which has ratings for two topic label modalities. In addition to benefiting topic labelling research, it has potential applications in other language and vision tasks such as image captioning.

Models
Our baseline model (baseline) combines the two methodologies of Aletras and Mittal (2017) and Bhatia et al. (2016). That is, we generate and rank textual and image labels based on Bhatia et al. (2016) and Aletras and Mittal (2017) respectively, and then generate a combined ranking based on the predicted ratings. 3 The baseline model views 3 Bhatia et al. (2016) originally used SVR to rank textual labels. We re-ran their model using the same features and SVR to predict label ratings, allowing us to combine both  the two modalities (image and textual labelling) as two distinct tasks and does not leverage potential complementarity between them. We propose a simple feed-forward neural that jointly re-ranks the two topic label modalities (joint-NN). In joint-NN, we first generate the candidate image labels and textual labels using the methodologies of Aletras and Mittal (2017) and Bhatia et al. (2016), respectively. However, unlike baseline where the labels are ranked separately, joint-NN feeds both label modalities into a single network to predict their ratings. The network architecture is depicted in Figure 1.
Each input modality is fed into two dense layers that are unconnected. The hidden representation at the 4th layer of the networks is then passed to a joint/shared hidden layer before the final output layer. All connections between layers are dense connections and the final output layer has a sigmoid activation, while all other hidden layers have ReLU activations. The first four layers are kept separate to allow the network to transform the embeddings from the two different modalities to a common hidden representation. The shared layers leverage potential complementarity between the two label modalities to predict the final label rating.
We generate the textual labels following the label generation methodology of Bhatia et al. (2016), as part of which, the labels and topic terms each have representations based on doc2vec and word2vec embeddings, respectively. We concatenate all four embeddings and use them as the input for the network. 4 Bhatia et al. (2016) found that letter trigram features and PageRank features were strong features when re-ranking the labels. We borrow this idea, and incorporate these two features into the network by mapping the 2textual and image labels and rank them using their predicted ratings. 4 Each type of embedding has 300 dimensions; the concatenated input thus has 4 × 300 = 1200 dimensions. food, eat, cook, chicken, recipe, drive, computer, card, laptop, memory, cup, cheese, add, taste, tomato battery, usb, intel, processor Table 3: Example of two topics and their generated textual and image labels and predicted ratings.

Topic Terms
dimensional input (representing the letter trigram and PageRank features) into a 128-dimension vector and concatenating it with the 256-dimension hidden representation at the third layer (thus yielding a 384-dimension vector). 5 For the visual labels, the topic terms use the same doc2vec and word2vec embeddings. For the image labels, we use the representation of the last layer of the VGG Neural Network (Simonyan and Zisserman, 2014). As before, the vectors for the topic terms and image labels are concatenated and fed as input to the network. 6 As a control to test whether the sharing of weights helps with the prediction of label ratings, we experiment with another network (disjoint-NN) that has the same architecture as joint-NN, except that the final few layers are not shared and the two networks are trained independently.

Experiments and results
Following standard practice in topic labelling evaluation (Lau et al., 2011;Aletras and Stevenson, 2013;Bhatia et al., 2016), we use "top-1 average rating" as the evaluation metric. It computes the mean rating of the top-ranked label generated by the system, and provides an assessment of the absolute utility of the labels. For example, if the topranked label predicted by the system has an average rating of 3.0, that means the system are generating perfect topic labels. 7 We present the results of all systems (baseline, joint-NN and disjoint-NN) in Table 2. Each model is trained using 10-fold cross-validation for 10 epochs. Presented results are an average over 20 runs. We display three types of evaluation: (1) "multimodal", where we pool both label modalities together and evaluate jointly; (2) "visualonly", where we evaluate only the visual labels; and (3) "textual-only", where we evaluate only the textual labels. In addition to the 3 systems, for each topic we determine the rating of the best label and compute its mean over all topics, as the upper bound for the task (labelled "upper bound").
Encouragingly, joint-NN -which exploits information from both input modalities -achieves the best performance. The improvement compared to disjoint-NN is substantial, and much of the improvement is in the textual labels. However, when compared to baseline, most of the gain is in the visual labels. These observations seem a little unintuitive; to better understand them we first look at baseline and disjoint-NN.
In terms of methodology, the difference between baseline and disjoint-NN is their rerankers. Both the image label re-rankers of baseline and disjoint-NN are driven by neural networks, but the re-ranker of disjoint-NN has an additional layer (5 vs. 4). 8 The improvement of results for the visual labels could thus be attributed to the additional hidden layer.
On the other hand, the performance difference for the textual labels between baseline vidual label modalities and the combined labels differ, meaning that the nDCG numbers are not directly comparable. and disjoint-NN is attributed to the classifiers (baseline = SVR; disjoint-NN = neural network), since they both share the same features. These results suggest that SVR is the superior classifier in this case.
However, when we share the latent representations for the last few layers (joint-NN), we see that results improve substantially. In particularly, textual label performance is on par with baseline, suggesting the addition of image label data helps learn the latent representations of textual labels. As a whole, this suggests there is strong complementarity between the two different modalities of labels and highlights the strength of a multimodal network.
Lastly, it is worth mentioning that the multimodal evaluation yields the highest rating across all systems. This suggests that, consistent with the findings of Aletras and Mittal (2017), different topics may have different optimal label representations (image or textual), and that the best performance is achieved when we allow the model to dynamically select between modalities. We present a sample of generated textual and image label for a topic in Table 3.
Looking at the upper bound, we see there is considerable room for further improvement. The models we have experimented with are based on simple feed-forward architectures, and the input representation is pre-computed, and thus not updated in the network. An immediate direction for future work would be designing end-to-end architectures that take the input as raw features (e.g. using the image pixels for the image labels).

Conclusions
In this paper, we have proposed a multimodal approach to automatic topic labelling, based on a deep neural network. Compared to benchmark systems, our joint model achieves the best performance, demonstrating the strength of modelling different label modalities jointly.
Another contribution of the paper is the development of a multimodal dataset which we have released publicly. The dataset, which contains annotations for image and textual labels, could have applications for other multimodal NLP tasks.