I Can Has Cheezburger? A Nonparanormal Approach to Combining Textual and Visual Information for Predicting and Generating Popular Meme Descriptions

The advent of social media has brought Internet memes, a unique social phenomenon, to the front stage of the Web. Embodied in the form of images with text descriptions, little do we know about the “language of memes”. In this paper, we statistically study the correlations among popular memes and their word-ings, and generate meme descriptions from raw images. To do this, we take a multi-modal approach—we propose a robust non-paranormal model to learn the stochastic dependencies among the image, the candidate descriptions, and the popular votes. In experiments, we show that combining text and vision helps identifying popular meme descriptions; that our nonparanormal model is able to learn dense and continuous vision features jointly with sparse and discrete text features in a principled manner, outperforming various competitive baselines; that our system can generate meme descriptions using a simple pipeline.


Introduction
In the past few years, Internet memes become a new, contagious social phenomenon: it all starts with an image with a witty, catchy, or sarcastic sentence, and people circulate it from friends to friends, colleagues to colleagues, and families to families. Eventually, some of them go viral on the Internet.
Meme is not only about the funny picture, the Internet culture, or the emotion that passes along, but also about the richness and uniqueness of its language: it is often highly structured with special written style, and forms interesting and subtle connotations that resonate among the readers. For example, the LOL cat memes (e.g., Figure 1) often include superimposed text with broken grammars and/or spellings.
Even though the memes are popular over the Internet, the "language of memes" is still not wellunderstood: there are no systematic studies on predicting and generating popular Internet memes from the Natural Language Processing (NLP) and Computer Vision (CV) perspectives.
In this paper, we take a multimodal approach to predict and generate popular meme descriptions. To do this, we collect a set of original meme images, a list of candidate descriptions, and the corresponding votes. We propose a robust nonparanormal approach (Liu et al., 2009) to model the multimodal stochastic dependencies among images, text, and votes. We then introduce a simple pipeline for generating meme descriptions combining reverse image search and traditional information retrieval approaches. In empirical experiments, we show that our model outperforms strong discriminative baselines by very large margins in the regression/ranking experiments, and that in the generation experiment, the nonparanormal outperforms the second-best supervised baseline by 4.35 BLEU points, and obtains a BLEU score improvement of 4.48 over an unsupervised recurrent neural network language model trained on a large meme corpus that is almost 90 times larger. Our contributions are three-fold: • We are the first to study the "language of memes" combining NLP, CV, and machine learning techniques, and show that combining the visual and textual signals helps identifying popular meme descriptions; • Our approach empowers Internet users to select better wordings and generate new memes automatically; • Our proposed robust nonparanormal model outperforms competitive baselines for predicting and generating popular meme descriptions.
In the next section, we outline related work. In Section 3, we introduce the theory of copula, and our nonparanormal approach. In Section 4, we describe the datasets. We show the prediction and generation results in Section 5 and Section 6. Finally, we conclude in Section 7.

Related Work
Although the language of Internet memes is a relatively new research topic, our work is broadly related to studies on predicting popular social media messages (Hong et al., 2011;Bakshy et al., 2011;Artzi et al., 2012). Most recently, Tan et al. (2014) study the effect on wordings for Tweets. However, none of the above studies have investigated multimodal approaches that combine text and vision.
Recently, there has been growing interests in inter-disciplinary research on generating image descriptions. Gupta el al. (2009) have studied the problem of constructing plots from video understanding. The work by Farhadi et al. (2010) is among the first to generate sentences from images. Kulkarni et al. (2011) use linguistic constraints and a conditional random field model for the task, whereas  leverage syntactic information and co-occurrence statistics and  use a large text corpus and CV algorithms for detecting visual text. With the surge of interests in deep learning techniques in NLP (Socher et al., 2013;Devlin et al., 2014) and CV Oquab et al., 2013), there have been several unrefereed manuscripts on parsing images and generating text descriptions lately (Vinyals et al., 2014;Chen and Zitnick, 2014;Donahue et al., 2014;Fang et al., 2014;Karpathy and Fei-Fei, 2014) using neural network models. Although the above studies have shown interesting results, our task is arguably more complex than generating text descriptions: in addition to the visual and textual signals, we have to model the popular votes as a third dimension for learning. For example, we cannot simply train a convolutional neural network image parser on billions of images, and use recurrent neural networks to generate texts such as "There is a white cat sitting next to a laptop." for Figure 1. Additionally, since not all images are suitable as meme images, collecting training images is also more challenging in our task.
In contrast to prior work, we take a very different approach: we investigate copula methods (Schweizer and Sklar, 1983;Nelsen, 1999), in particular, the nonparanormals (Liu et al., 2009), for joint modeling of raw images, text descriptions, and popular votes. Copula is a statistical framework for analyzing random variables from Statistics , and often used in Economics (Chen and Fan, 2006). Only until very recently, researchers from the machine learning and information retrieval communities (Ghahramani et al., 2012;Han et al., 2012;Eickhoff et al., 2013). start to understand the theory and the predictive power of copula models. Wang and Hua (2014) are the first to introduce semiparametric Gaussian copula (a.k.a. nonparanormals) for text prediction. However, their approach may be prone to overfitting. In this work, we generalize Wang and Hua's method to jointly model text and vision features with popular votes, while scaling up the model using effective dropout regularization.

Our Approach
A key challenge for joint modeling of text and vision is that, because textual features are often relatively sparse and discrete, while visual features are typically dense and continuous, it is difficult to model them jointly in a principled way.
To avoid comparing "apple and oranges" in the same probabilistic space, we propose the nonparanormal approach, which extends the Gaussian graphical model by transforming its variables by smooth functions. More specifically, for each dimension of textual and visual features, instead of using raw counts or histograms, we first use probability integral transform to generate empirical cumulative density functions (ECDF): now instead of the probability density function (PDF) space, we are working in the ECDF space where the value of each feature is based on the rank, and is strictly restricted between 0 and 1. Then, we use kernel density estimation to smooth out the zeroing features 1 . Finally, now textual and visual features are compatible, and we then build a parametric Gaussian copula model to estimate the pair-wise correlations among the covariate and the dependent variable.
In this section, we first explain the visual and textual features used in this study. Then, we introduce the theory of copula, and describe the robust nonparanormal. Finally, we show a simple pipeline for generating meme descriptions.

Features
Textual Features To model the meme descriptions, we take a broad range of textual features into considerations: • Lexical Features: we extract unigrams and bigrams from meme descriptions as surface-level lexical features.
• Dependency Triples: to better understand the deeper syntactic dependencies of keywords in • Named Entity Features: after browsing the dataset, we notice that certain names are often mentioned in memes (e.g. "Drake", "Kenye West", and "Justin Bieber"), so we utilize the Stanford named entity recognizer (Finkel et al., 2005) to extract lexicalized named entities.
• Frame-Semantics Features: SEMAFOR (Das et al., 2010) is a state-of-the-art framesemantics parser that produces FrameNet-style semantic annotation. We use SEMAFOR to extract frame-level semantic features.

Visual Features
A key insight on viral memes is that the images producing a shared social signal are typically inter-related in style. For example, LOLcats are an early series of memes involving funny cat photos. Similarly, "Bieber memes" involve modified pictures of Bieber. Therefore, we hypothesize that, by extracting visual features, it is of crucial importance to capture the entities, objects, and styles as visual words in these inter-related meme images. The popular visual bag-of-words representation (Sivic and Zisserman, 2003) is used to describe images: 1. PHOW Features Extraction: unlike text features, SIFT first detects the Harris keypoints from an image, and then describes each keypoint with a vector. An example of the SIFT frames are shown in Figure 3. PHOW (Bosch et al., 2007) is a dense and multi-scale variant of the Scale Invariant Feature Transform (SIFT) descriptors. Using PHOW, we obtain about 20K keypoints for each image.
2. Elkan K-means Clustering is the clustering method (Elkan, 2003) that we use to obtain the vocabulary for visual words. Comparing to other variants of K-means, this method quickly constructs the codebook from PHOW keypoints. 3. Bag-of-Words Histograms are used to represent each image. We match the PHOW keypoints of each image with the vocabulary that we extract from the previous step, and generate a 1 × 200 sized visual bag-of-words vector.

The Theory of Copula
In the Statistics literature, copula is widely known as a family of distribution function. The idea behind copula theory is that the cumulative distribution function (CDF) of a random vector can be represented in the form of uniform marginal cumulative distribution functions, and a copula that connects these marginal CDFs, which describes the correlations among the input random variables. However, in order to have a valid multivariate distribution function regardless of n-dimensional covariates, not every function can be used as a copula function. The central idea behind copula, therefore, can be summarize by the Sklar's theorem and the corollary. Theorem 1 (Sklar's Theorem (1959)) Let F be the joint cumulative distribution function of n random variables X 1 , X 2 , ..., X n . Let the corresponding marginal cumulative distribution functions of the random variable be F 1 (x 1 ), F 2 (x 2 ), ..., F n (x n ). Then, if the marginal functions are continuous, there exists a unique copula C, such that Furthermore, if the distributions are continuous, the multivariate dependency structure and the marginals might be separated, and the copula can be considered independent of the marginals (Joe, 1997; Parsa and Klugman, 2011). Therefore, the copula does not have requirements on the marginal distributions, and any arbitrary marginals can be combined and their dependency structure can be modeled using the copula. The inverse of Sklar's Theorem is also true in the following: Corollary 1 If there exists a copula C : (0, 1) n and marginal cumulative distribution functions defines a multivariate cumulative distribution function.

The Nonparanormal
To model multivariate text and vision variables, we choose the nonparanormal (NPN) as the copula function in this study, which can be explained in the following two parts.

The Nonparametric Estimation
Assume we have n random variables of vision and text features X 1 , X 2 , ..., X n . The problem is that text features are sparse, so we need to perform nonparametric kernel density estimation to smooth out the distribution of each variable. Let f 1 , f 2 , ..., f n be the unknown density, we are interested in deriving the shape of these functions. Assume we have m samples, the kernel density estimator can be defined as:f Here, K(·) is the kernel function, where in our case, we use the Box kernel 2 K(z): = 0, |z| > 1.
Comparing to the Gaussian kernel and other kernels, the Box kernel is simple, and computationally inexpensive. The parameter h is the bandwidth for smoothing 3 . Now, we can derive the empirical cumulative distribution functionŝ of the smoothed covariates, as well as the dependent variable y (which is the reciprocal rank of the popular votes of a meme) and its CDFF y (f (y)). The empirical cumulative distribution functions are defined as:F where I{·} is the indicator function, and ν indicates the current value that we are evaluating. Note that the above step is also known as probability integral transform (Diebold et al., 1997), which allows us to convert any given continuous distribution to random variables having a uniform distribution. This is crucial for text: instead of using the raw counts, we are now working with uniform marginal CDFs, which helps coping with the overfitting issue due to noise and data sparsity. We also use the same procedure to transform the vision features into CDF space to be compatible with text features.

The Robust Estimation of Copula
Now that we have obtained the marginals, and then the joint distribution can be constructed by applying the copula function that models the stochastic dependencies among marginal CDFs: In this work, we apply the parametric Gaussian copula to model the correlations among the text features and the label. Assume x i is the smoothed version of random variable X i , and y is the smoothed label, we have: F (x 1 , ..., x n , y) where Φ Σ is the joint cumulative distribution function of a multivariate Gaussian with zero mean and Σ variance. Φ −1 is the inverse CDF of a standard Gaussian. In this parametric part of the model, the parameter estimation boils down to the problem of learning the covariance matrix Σ of this Gaussian copula. In this work, we perform standard maximum likelihood estimation (MLE) for the Σ matrix, where we follow the details from prior work (Wang and Hua, 2014).
To avoid overfitting, traditionally, one resorts to classic regularization techniques such as Lasso (Tib-shirani, 1996). While Lasso is widely used, the nondifferentiable nature of the L 1 norm often make the objective function difficult to optimize. In this work, we propose dropout training  as copula regularization. Dropout was proposed by Hinton et al. as a method to prevent feature coadaptation in the deep learning framework, but recently studies (Wager et al., 2013) also show that its behaviour is similar to L 2 regularization, and can be approximated efficiently (Wang and Manning, 2013) in many other machine learning tasks. Another advantage of dropout training is that, unlike Lasso, it does not require all the features for training, and training is "embarrassingly" parallelizable.
In Gaussian copula estimation context, we can introduce another dimension : the number of dropout learners, to extend the Σ into a dropout tensor. Essentially, the task becomes the estimation of Σ 1 , Σ 2 , ..., Σ where the input feature space for each dropout component is randomly corrupted by (1 − δ) percent of the original dimension. In the inference time, we use geometric mean to average the predictions from each dropout learner, and generate the final prediction. Note that the final Σ matrix has to be symmetric and positive definite, so we apply tiny random Gaussian noise to maintain the property.

Computational Complexity
One important question regarding the proposed nonparanormal model is the corresponding computational complexity. This boils down to the estimation of theΣ matrix : one only needs to calculate the correlation coefficients of n(n − 1)/2 pairs of random variables. Christensen (2005) shows that sorting and balanced binary trees can be used to calculate the correlation coefficients with complexity of O(n log n). Therefore, the computational complexity of MLE for the proposed model is O(n log n).

Efficient Approximate Inference
In this prediction task, in order to perform the exact inference of the conditional probability distribution p(F y (y)|F x 1 (x 1 ), ..., F xn (x n )), one needs to solve the mean responsê E(F y (y)|F x 1 (x 1 ), ..., F x 1 (x 1 )) from a joint distribution of high-dimensional Gaussian copula. Unfortunately, the exact inference can be intractable in the multivariate case, and approximate inference, such as Markov Chain Monte Carlo sampling (Gelfand and Smith, 1990;Pitt et al., 2006) is often used for posterior inference. In this work, we propose an efficient sampling method to derive y given the text features -we samplê F y (y) s.t. it maximizes the joint high-dimensional Gaussian copula density: This approximate inference scheme using maximum density sampling from the Gaussian copula significantly relaxes the complexity of inference. Finally, to deriveŷ, the last step is to compute the inverse CDF ofF y (y). A detailed description of the inference algorithm can be found in our prior work (Wang and Hua, 2014).

A Simple Meme Generation Pipeline
Now after we train a nonparanormal model for ranking meme descriptions, we show the simple meme generation pipeline in Figure 4.
Given a test image, we disguise as the Internet Explorer, and query Google's "Search By Image" inverse image search service 4 . By comparing the 4 http://www.google.com/imghp/ query image with all possible images with their captions in Google's database, a "Best Guess" of the keywords in the image is then revealed.
Using the extracted image keywords, we further query a TF-IDF based Lucene 5 meme search engine, which we indexed with a large number of Webcrawled meme descriptions. After we obtain the candidate generations, we then extract all the text and vision features that we described in Section 3.1. Finally, our nonparanormal model ranks all possible candidates, and selects the final generation with the highest posterior.

Datasets
We collected meme images and text descriptions 6 from two popular meme websites 7 . In the prediction experiment, we use 3,008 image-description pairs for training, and 526 image-description pairs for testing. In the generation experiment, we use 269,473 meme descriptions to index the meme search engine, and 50 randomly selected images for testing. During training, we convert the raw counts of popular votes into reciprocal ranks (e.g., the most popular text descriptions will all have a reciprocal rank of 1, and n-th popular one will have a score of 1/n).

Prediction Experiments
In the first experiment, we compare the proposed NPN with various baselines in a prediction task, since prior literature (Hodosh et al., 2013) also suggests using ranking based evaluation for associating images with text descriptions. Throughout the experiment sections, we set = 10, and δ = 80 as the dropout hyperparameters.

Baselines:
The baselines are standard squared-loss linear regression, linear kernel SVM, and non-linear (Gaussian) kernel SVM. In a recent empirical study (Fernández-Delgado et al., 2014) that evaluates 179 classifiers from 17 families on 121 UCI datasets, the authors find that Gaussian SVM is one of the top performing classifiers. We use the Statistical Toolbox's linear regression implementation in Matlab, and LibSVM (Chang and Lin, 2011) for training and testing the SVM models. The hyperparameter C in linear SVM, and the γ and C hyperparameters in Gaussian SVM are tuned on the training set using 10-fold cross-validation.

Evaluation Metrics:
Spearman's correlation (Hogg and Craig, 1994) and Kendall's tau (Kendall, 1938) have been widely used in many real-valued prediction (regression) problems in NLP (Albrecht and Hwa, 2007;Yogatama et al., 2011), and here we use them to measure the quality of predicted valuesŷ by comparing to the vector of ground truth y. Kendall's tau is a nonparametric statistical metric that have shown to be inexpensive, robust, and representation independent (Lapata, 2006). We use paired two-tailed t-test to measure the statistical significance.

Comparison with Various Baselines
The first two figures in Figure 5 show the learning curve of our system, comparing other baselines. We see that when increasing the amount of training data, our approach clearly dominates all other methods by a large margin. Linear and Gaussian SVMs perform similarly, and have good performances with only 25% of the training data, but the improvements are not large when increasing the amount of training data.
In the last two figures in Figure 5, we increase the amount of features, and compare various models. We see that the linear regression model overfits with 600 features, and Gaussian SVM outperforms the linear SVM. We see that our NPN model clearly outperforms all baselines by a big gap, and does not overfit.

Combination of Text and Vision
In Table 1, we systematically compare the contributions of each feature set. First, we see that bigram features clearly improve the performance on top of unigram features. Second, named entities are crucial for further boosting the performance. Third, adding the shallow part-of-speech features does not benefit all models, but the dependency triples are shown to be useful for all methods. Finally, we see that using semantic features helps increasing the performances for most of the cases, and combining text and vision features in our NPN framework doubles the perfor- mance for associating popular votes, meme images, and text descriptions.

The Effects of Dropout Training for Nonparanormals
As we mentioned before, because NPNs model the complex network of random variables, a key issue for training NPN is to prevent the model from overfitting to the training data. So far, none of the prior work have investigated dropout training for regularizing the nonparanormals or even copula in general.
To empirical test the effects of dropout training for nonparanormals, in addition to our datasets, we also compare with the unregularized copula from Wang and Hua (2014) on predicting financial risks from earnings calls. Table 2 clearly suggests that dropout training for NPNs significant improves the performances on various datasets. Table 3 shows the top ranked text features that are highly correlated with popular votes. We see that the named entity features are useful: Paul Walker, UPS, Bruce Willis, Pencil Guy, Amy Winehouse are recognized as entities in the meme dataset. Dependency triples, as a less-understood feature set, also perform well in this task. For example, xcomp(tell,mean)   captures the dependency relation of the popular meme series "You mean to tell me...". Interestingly, the transitional dependency feature dep(when,but) plays an important role in the language of memes. The object of a preposition, such as pobj (vegas,in) and pobj(life,of), also made the list. Bigrams are shown to be important features as usual. For example, "Yo daw" is a popular meme based on rapper Xzibit's famous reality car show "Pimp My Ride", where the rapper customizes people's car according to personal preferences. This viral meme follows the pattern 8 of "Yo daw(g), I herd you like X (noun), so I put an X in your Y (noun) so you can W (verb) while you Z (verb)."

Qualitative Analysis
The use of pronouns, captured by frame semantics features, is associated with popular memes. We hypothesize that by using pronouns such as "i", "you", "we", and "they", the meme recalls personal experiences and emotions, thus connects better with the audience. Finally, we see that the punctuation bigram "... :" is an important feature in the language 8 http://knowyourmeme.com/memes/xzibit-yo-dawg  of memes, and Web dialect such as "y" (why) also exhibits high correlation with the popular votes.

Generation Experiments
In this section, we investigate the performance of our meme generation system using 50 test meme images. To quantitatively evaluate our system, we compare with both unsupervised and supervised baselines. For the unsupervised baselines, we compare with a compact recurrent neural network language model (RNNLM) (Mikolov, 2012) trained on the 3,008 text descriptions of our meme training set, as well as a full model of RNNLM trained on a large meme corpus of 269K sentences 9 . For the supervised baselines, all models are trained on the 3,008 training image-description pairs with labels. All these models can be viewed as different re-ranking methods for the retrieved candidate descriptions. We use BLEU score (Papineni et al., 2002) as the evaluation metric, since the generation task can be viewed as translating raw images into sentences, and it is BLEU points over the full RNNLM, which is trained on a corpus that is ∼90 times larger, in an unsupervised fashion. When breaking down the results, we see that our NPN's advantage is on generating longer phrases, typically trigrams and four-grams, comparing to the other models. This is very interesting, because generating high-quality long phrases is difficult, since the memes are often short. We show some generation examples in Figure 6. We see that on the left column, the reference memes are the ones with top votes by the crowd. The first chemistry cat meme includes puns, the second forever alone meme includes reference to the life simulation video game, while the last Batman meme has interesting conversations. In the second column, we see that the memes generated by the full RNNLM model are short, which corresponds to the quantitative results in Table 4. In the third column, our NPN meme generator was able to generate longer descriptions. Interestingly, it also creates a pun for the chemistry cat meme. Our generation on the forever alone meme is also accurate. In the Batman example, we show that the NPN model makes a sentence-image-mismatch type of error: although the generated sentence includes the entities Batman and Robin, as well as their slapping activity, it was originally created for the "overly attached girlfriend" meme 10 .

Conclusions
In this paper, we study the language of memes by jointly learning the image, the description, and the popular votes. In particular, we propose a robust nonparanormal approach to transform all vision and text features into the cumulative density function space. By learning the stochastic dependencies, we show that our model significantly outperforms various competitive baselines in the prediction experiments. In addition, we also propose a simple pipeline for generating memes from raw images, drawing the wisdom from reverse image search and traditional information retrieval perspectives. Finally, we show that our model obtains significant BLEU point improvements over an unsupervised RNNLM baseline trained on a larger corpus, as well as other strong supervised baselines.