Embedding Lexical Features via Tensor Decomposition for Small Sample Humor Recognition

We propose a novel tensor embedding method that can effectively extract lexical features for humor recognition. Specifically, we use word-word co-occurrence to encode the contextual content of documents, and then decompose the tensor to get corresponding vector representations. We show that this simple method can capture features of lexical humor effectively for continuous humor recognition. In particular, we achieve a distance of 0.887 on a global humor ranking task, comparable to the top performing systems from SemEval 2017 Task 6B (Potash et al., 2017) but without the need for any external training corpus. In addition, we further show that this approach is also beneficial for small sample humor recognition tasks through a semi-supervised label propagation procedure, which achieves about 0.7 accuracy on the 16000 One-Liners (Mihalcea and Strapparava, 2005) and Pun of the Day (Yang et al., 2015) humour classification datasets using only 10% of known labels.


Introduction
Recognizing humor automatically is an important step for natural human-computer interaction (Shahaf et al., 2015). While early works tend to frame humor recognition as a binary classification task (Mihalcea and Strapparava, 2005;Yang et al., 2015), the last few years have seen the emergence of humor recognition as a pairwise relative ranking task (Cattle and Ma, 2016;Shahaf et al., 2015). In addition to pairwise ranking, SemEval 2017 Task 6 also includes a global ranking subtask. However, the majority of submissions build * Zhenjie Zhao and Andrew Cattle contributed equally to this work. † E. Papalexakis was supported by a UCR-China collaboration grant by the Bourns College of Engineering at UCR, and by the National Science Foundation CDSE Grant no. OAC-1808591 global rankings using a series of pairwise comparisons (Potash et al., 2017). Only Yan and Pedersen (2017) attempt to predict global rankings directly, ranking documents inversely to their probability according to an n-gram language model. State-of-the-art humor recognition algorithms usually require a considerable amount of training data with labels to learn effective features (Yang et al., 2015). However, such data are difficult to obtain -especially fine-grained humor annotations. First, the humor judgments differ from individual to individual. Thus, collecting perceptually consistent human labels is expensive and time-consuming. Second, fine-grained degrees of humor add a further challenge. Therefore, methods on small sample learning or even unsupervised rule-based methods merit investigation.
In this paper, considering the importance of lexical information for humor recognition (Radev et al., 2015), we propose a tensor decomposition method to capture the contextual nuances of a corpus. This allows us to model the lexical similarity of sentences regardless of the size of the corpus. In this way, we can rank the degree of humor effectively via lexical centrality (Radev et al., 2015), namely, regarding the distance to the lexical center as an indicator of the degree of humor. Experimental results on the SemEval 2017 Task 6 dataset (Potash et al., 2017) show that without external training data, the tensor embedding method can achieve performance equivalent to the second place on SemEval 2017 Task 6B without the need for any external training corpus. In addition, by applying a semi-supervised label propagation procedure (Zhou et al., 2003), we can also use the tensor embedding method for small sample humor recognition, achieving about 0.7 accuracy with only 10% of known labels on the 16000 One-Liners (Mihalcea and Strapparava, 2005) and Pun of the Day (Yang et al., 2015) datasets.
The contributions of this paper are: 1) we propose a tensor embedding method to model the lexical features of documents, which can capture lexical similarity effectively regardless of the size of the corpus, 2) we show that the lexical features can be used effectively for finegrained humor ranking and small sample humor recognition. Our implementation is open-sourced, and can be found at https://github.com/ zhaozj89/TensorEmbeddingNLP.

Humor Feature Extraction
Modeling and learning humor features are critical for automatic humor recognition. Previous works tend to use a combination of phonological, stylistic, semantic, and content-based features. Phonological features include acoustic features extracted from sitcom audio tracks (Bertero and Fung, 2016) and "phonetic embeddings" generated using a character-to-phoneme LSTM encoder-decoder (Donahue et al., 2017). Stylistic features include alliteration, rhyming, negative sentiment, and adult slang (Mihalcea and Strapparava, 2005) as well as emotional scenarios (Reyes et al., 2012). Semantic features range from attempts to measure incongruity (Cattle and Ma, 2018;Shahaf et al., 2015;Yang et al., 2015) to the use of word embeddings as inputs to neural models (Bertero and Fung, 2016;Donahue et al., 2017). Content-based approaches include word frequency (Mihalcea and Strapparava, 2005), n-gram probability (Yan and Pedersen, 2017), and lexical centrality (Radev et al., 2015).
Centrality is based on the observation that humorous responses to common stimuli tend to cluster around a small number of core jokes (Radev et al., 2015;Shahaf et al., 2015), with more central documents benefiting from "wisdom of the crowd". While most humor features involve making population-level inferences based on document-level features, centrality is instead population-level feature directly. Radev et al. (2015) calculate their centrality feature using LexRank, a graph-based text summarization method (Erkan and Radev, 2004). Compared with more traditional lexical similarity measures like tfidf, this method is better suited to short humor texts due to their short lengths leading to sparse vector representations (Erkan and Radev, 2004).

Small Sample Humor Recognition
Once the humor features have been extracted, the next step is training a machine learning model to make predictions. Although learning-based methods have shown significant performance improvement recently (Yang et al., 2015), one of their main bottlenecks is the lack of appropriate training corpora. While previous works have employed data crawled from websites (Mihalcea and Strapparava, 2005;Yang et al., 2015), Twitter (Cattle and Ma, 2016;Reyes et al., 2012), sitcom subtitles (Bertero and Fung, 2016;Purandare and Litman, 2006), or the New Yorker Cartoon Caption Contest (Radev et al., 2015;Shahaf et al., 2015), these datasets are generally not released publicly. Owing to the difficulty in obtaining fine-grained labeled humor data, it is critical to study how to recognize humor by a small training sample or even without labeled data.

Tensor Embedding
Contextual patterns of words can be used to measure lexical similarity for humor recognition. State-of-the-art learning-based approaches like doc2vec (Le and Mikolov, 2014) or sent2vec (Pagliardini et al., 2018) usually require a large amount of data. This is difficult to obtain for humor recognition. We propose to use a novel tensor decomposition method to obtain lexical features of short humor texts. To capture lexical similarity for humor recognition, we propose to represent the tensor through a novel word-word co-occurrence method, which has only been explored in the context of fake news detection (Hosseinimotlagh and Papalexakis, 2018). Considering a corpus D = {s 1 , s 2 , . . . , s D } with D sentences, we first build a vocabulary for it, namely, w 1 , w 2 , . . . , w V , where V is the number of words. For each sentence s in D, we count the wordword co-occurrence in a small window H, and build a frequency matrix W s ∈ Z V ×V , where Z denotes the set of integers. In particular, W s (i, j) indicates the frequency that word w i and w j cooccur in s within the window H. In this way, we can capture the lexical patterns of s in W s . We then stack all W s as a three-dimensional tensor W ∈ Z V ×V ×D . The objective of tensor decomposition is to find an approximationŴ of W so   Table 2: The results of our label propagation system. XX% refers to XX% of the data used as training and (100-XX)% as test. Baseline results reproduced from Yang et al. (2015) that:Ŵ where v r ∈ R V , d r ∈ R D , R is the predefined rank parameter, and ⊗ is the outer product, namely, v r ⊗ v r ⊗ d r being a three-dimensional tensor, and With the tensor decomposition, we can find low-rank embeddings of sentences that capture the similarity of contextual patterns (Hosseinimotlagh and Papalexakis, 2018). In particular, C = [d 1 , d 2 , . . . , d R ] ∈ R D×R , where the s-row of C is the embedding vector of sentence s. The Euclidean distance of embeddings is used to measure the similarity of two sentences.

Lexical Centrality
We use lexical centrality to rank the degree of humor (Radev et al., 2015). While Radev et al. (2015) utilize a graph-based definition of centrality, we instead take a vector-space approach. Given the decomposed C = [c T 1 , c T 2 , . . . , c T D ] T , we compute a centroid as the average m of all sentence vectors of a corpus: The Euclidean distance to the center is then taken as an indicator of the degree of humor. In other words, given two sentences s 1 and s 2 and their embeddings x 1 and x 2 , d(m, x1) < d(m, x2) implies s 1 is funnier than s 2 .

Label Propagation
With the lexical similarity captured by tensor embeddings, we can build a similarity graph, and use a label propagation algorithm (Zhou et al., 2003) for semi-supervised humor recognition. In this way, we can use only a small portion of labeled data to predict the remaining unlabeled data effectively (Zhou et al., 2003). In particular, with the tensor embeddings, we first find the K nearest neighbors of each data point, and build a similarity graph G. We then form an affinity matrix W , where W ij = 1 if i and j are connected, otherwise, W ij = 0. Afterwards, we iterate: and can get the results F * as the limit of this sequence. Equation (4) means we propagate the labels of each data point to its neighbors in a weighted average way, where α is the ratio of propagating labels each iteration. For each point x i , its label is y i = arg max j≤c F * ij .

Experiment
To evaluate the effectiveness of the tensor embedding method, we conduct two experiments on global humor ranking and binary humor classification separately. The alternating least squares method of CANDECOMP/PARAFAC tensor decomposition (Sidiropoulos et al., 2017) is used to calculate the low-rank sentence embeddings as implemented in the Matlab tensor toolbox 1 .

Global Humor Ranking
To show the effectiveness of the lexical centrality of our tensor embedding method, we conduct an experiment on SemEval 2017 Task 6B (Potash et al., 2017) consisting of tweeted responses to specific thematic prompts generated as part of a TV show. For each prompt, the writing staff of the show pick a top 10 and an overall winner. These humor judgments are used as gold standard labels. Tensor embeddings and centroids are computed on a per-prompt basis and responses are ranked according to their distance from the centroid. We run a grid search procedure to determine the optimal rank value as 100, the window size as 5. For evaluation, we adopt the same edit distance-based metric used in Potash et al. (2017).
The results of our lexical centrality system using tensor embeddings is shown in Table 1, where the official results of other state-of-the-art systems are taken from Potash et al. (2017). Our system outperforms all but the Duluth (Yan and Pedersen, 2017) system in the official results for Se-mEval 2017 Task 6B (Potash et al., 2017), making our performance equivalent to second place. It is notable that our system can perform well on the Broadway prompt, where other methods usually fail. Moreover, because our system does not have a learning procedure, the performance is more stable than others.

Binary Humor Classification
To show the effectiveness of label propagation of our tensor embedding method for small sample humor recognition, we conduct an experiment on two humor classification datasets 16000 One-Liners (Mihalcea and Strapparava, 2005) and Pun of the Day (Yang et al., 2015). Similarly, we run a grid search procedure to find optimal parameters, and set the rank as 10, window size as 5, neighbor number as 50, α as 0.2. F (0) is set as a zero matrix initially. For each dataset, we randomly select 5%, 10%, 30%, and 90% of the data for training. We run a 10-fold procedure, and report the average accuracy, precision, recall, and F1 score values.
The results of humor classification are shown in 1 www.tensortoolbox.org Table 2. Our own implementation of Yang et al. (2015) is included as a baseline. While Yang et al. (2015) uses a large portion of data for training and combine different features, we find that at similar portion of training data (90%), the results of our method are comparable to it. In addition, with only a small portion of training data, our method still achieves good results.

Lexical Centrality
The most notable aspect of our tensor embedding/lexical centrality approach is how little training data our system requires. Our system's unsupervised nature means that we do not need to use the 106 training prompts included with the SemEval 2017 Task 6 dataset. Our results are obtained exclusively using the six evaluation prompts. By comparison, almost all the systems reported in Potash et al. (2017) take a supervised approach and make full use of the training set. Furthermore, since we consider prompts one-ata-time and since each prompt only contains approximately 100 responses, we are able to achieve a state-of-the-art performance with 100 training documents. The only system reported in Potash et al. (2017) to take an unsupervised approach is Duluth (Yan and Pedersen, 2017). Like ours, their results are obtained without using the training set. However, their system uses an n-gram language model trained on a 6.2GB subset of the News Commentary Corpus and the News Crawl Corpus. Similarly, most other systems use some form of external training corpora for training word embeddings, phoneme models, semantic models, and so on.
Another advantage of our approach is the ease of interpretability, in contrast to neural-based state-of-the-art baselines. Because our lexical feature is in an Euclidean space, we can compare and rank humor level more easily. Tweets labeled as "overall winners" exhibited a smaller mean distance from their respective centroids (0.848) than those labeled as "merely in the top 10" (0.942). These tweets then in turn exhibited smaller distances than those labeled as "not in the top 10" (1.00). A one-way ANOVA test gives mild evidence that overall winners are drawn from a different distribution than tweets not in the top 10 (p = 0.106). This slight result is likely due to the fuzzy nature of humor and the relatively small dataset. Finally, ad hoc analysis of tweets with distances > 2 revealed these to be mostly "not in the top 10".

Label Propagation
Although the semi-supervised framework provides a good alternative for small sample humor recognition, our method still cannot achieve a state-of-the-art performance with the same portion of training data. There is still space to improve the method; for example, by modeling not only the lexical similarity, but also other features, such as word association (Cattle and Ma, 2016), and the like, that are important for humor recognition. In addition, label propagation cannot handle unbalanced data well. Adding prior knowledge of the ratio of labels, e.g., the unbalanced SemEval 2017 Task 6 dataset, also deserves further investigation.

Conclusion
In this paper, we show the importance of lexical features for small sample humor recognition. We propose a tensor embedding method to capture the lexical similarity effectively. Without training data, on SemEval 2017 Task 6B, we can achieve a relatively good result. Under a semi-supervised framework, the tensor embedding method can also achieve pretty good results for small sample humor classification. It is interesting to further investigate a unified tensor embedding model to combine not only lexical, but also other features that are important for the sense of humor.