Video Highlights Detection and Summarization with Lag-Calibration based on Concept-Emotion Mapping of Crowd-sourced Time-Sync Comments

With the prevalence of video sharing, there are increasing demands for automatic video digestion such as highlight detection. Recently, platforms with crowdsourced time-sync video comments have emerged worldwide, providing a good opportunity for highlight detection. However, this task is non-trivial: (1) time-sync comments often lag behind their corresponding shot; (2) time-sync comments are semantically sparse and noisy; (3) to determine which shots are highlights is highly subjective. The present paper aims to tackle these challenges by proposing a framework that (1) uses concept-mapped lexical-chains for lag calibration; (2) models video highlights based on comment intensity and combination of emotion and concept concentration of each shot; (3) summarize each detected highlight using improved SumBasic with emotion and concept mapping. Experiments on large real-world datasets show that our highlight detection method and summarization method both outperform other benchmarks with considerable margins.


Introduction
Every day, people watch billions of hours of videos on YouTube, with half of the views on mobile devices 1 . With the prevalence of video shar-1 https://www.youtube.com/yt/press/statistics.html ing, there is increasing demand for fast video digestion. Imagine a scenario where a user wants to quickly grasp a long video, without dragging the progress bar repeatedly to skip shots unappealing to the user. With automatically-generated highlights, users could digest the entire video in minutes, before deciding whether to watch the full video later. Moreover, automatic video highlight detection and summarization could benefit video indexing, video search and video recommendation.
However, finding highlights from a video is not a trivial task. First, what is considered to be a "highlight" can be very subjective. Second, a highlight may not always be captured by analyzing low-level features in image, audio and motions. Lack of abstract semantic information has become a bottleneck of highlight detection in traditional video processing.
Recently, crowdsourced time-sync video comments, or "bullet-screen comments" have emerged, where real-time generated comments will be flying over or besides the screen, synchronized with the video frame by frame. It has gained popularity worldwide, such as niconico in Japan, Bilibili and Acfun in China, YouTube Live and Twitch Live in USA. The popularity of the timesync comments has suggested new opportunities for video highlight detection based on natural language processing.
Nevertheless, it is still a challenge to detect and label highlights using time-sync comments. First, there is almost inevitable lag for comments related to each shot. As in Figure 1, ongoing discussion about one shot may extend to next a few shots. Highlight detection and labeling without lagcalibration may cause inaccurate results. Second, time-sync comments are sparse semantically, both in number of comments per shot and number of tokens per comment. Traditionally bag-of-words statistical model may work poorly on such data.
Third, there is much uncertainty in highlight detection in an unsupervised setting without any prior knowledge. Characteristics of highlights must be explicitly defined, captured and modeled.
To our best knowledge, little work has concentrated on highlight detection and labeling based on time-sync comments in unsupervised way. The most relevant work proposed to detect highlights based on topic concentration of semantic vectors of bullet-comments, and label each highlight with pre-trained classifier based on pre-defined tags (Lv, Xu, Chen, Liu, & Zheng, 2016). Nevertheless, we argue that emotion concentration is more important in highlight detection than general topic concentration. Another work proposed to extract highlights based on frame-by-frame similarity of emotion distribution (Xian, Li, Zhang, & Liao, 2015). However, neither work proposed to tackle the issue of lag-calibration, emotion-topic concentration balance and unsupervised highlight labeling simultaneously.
To solve these problems, the present study proposes the following: (1) word-to-concept and word-to-emotion mapping based on global wordembedding, from which lexical-chains are constructed for bullet-comments lag-calibration; (2) highlight detection based on emotional and conceptual concentration and intensity of lagcalibrated bullet-comments; (3) highlight summarization with modified Basic Sum algorithm that treats emotions and concepts as basic units in a bullet-comment.
The main contribution of the present paper are as follows: (1) We propose an entirely unsupervised framework for video highlight-detection and summarization based on time-sync comments; (2) We develop a lag-calibration technique based on concept-mapped lexical chains; (3) We construct large datasets for bullet-comment word-embedding, bullet-comment emotion lexicon and ground-truth for highlight-detection and labeling evaluation based on bullet-comments.

Highlight detection by video processing
First, following the definition in previous work (M. Xu, Jin, Luo, & Duan, 2008), we define highlights as the most memorable shots in a video with high emotion intensity. Note that highlight detection is different from video summarization, which focuses on condensed storyline representation of a video, rather than extracting affective contents (K.-S. Lin, Lee, Yang, Lee, & Chen, 2013).
For highlight detection, some researchers propose to represent emotions in a video by a curve on the arousal-valence plane with low-level features such as motion, vocal effects, shot length, and audio pitch (Hanjalic & Xu, 2005), color (Ngo, Ma, & Zhang, 2005), mid-level features such as laughing and subtitles (M. Xu, Luo, Jin, & Park, 2009). Nevertheless, due to the semantic gap between low-level features and high-level semantics, accuracy of highlight detection based on video processing is limited (K.-S. Lin et al., 2013).
The present study models the highlight detection as a simple two-objective optimization problem with constraints. However, the features chosen to evaluate the "highlightness" of a shot are different from the above studies. Because a highlight shot is observed to be correlated with high emotional intensity and topic concentration, coverage and non-redundancy are not goals of optimization any more, as in temporal text summarization. Instead, we focus on modeling emotional and topic concentration in present study.

Crowdsourced time-sync comment mining
Several works focused on tagging videos shot-byshot with crowdsourced time-sync comments by manual labeling and supervised training (Ikeda, Kobayashi, Sakaji, & Masuyama, 2015), temporal and personalized topic modeling (Wu, Zhong, Tan, Horner, & Yang, 2014), or tagging video as a whole (Sakaji, Kohana, Kobayashi, & Sakai, 2016). One work proposes to generate summarization of each shot by data reconstruction jointly on textual and topic level (L. Xu & Zhang, 2017). One work proposed a centroid-diffusion algorithm to detect highlights (Xian et al., 2015). Shots are represented by latent topics by LDA. Another work proposed to use pre-trained semantic vector of comments to cluster comments into topics, and find highlights based on topic concentration (Lv et al., 2016). Moreover, they use predefined labels to train a classifier for highlight labeling. The present study differs from these two studies in several aspects. First, before highlight detection, we perform lag-calibration to minimize inaccuracy due to comment lags. Second, we propose to represent each scene by the combination of topic and emotion concentration. Third, we perform both highlight detection and highlight labeling in unsupervised way.

Lexical chain
Lexical chains are a sequence of words in a cohesive relationship spanning in a range of sentences. Early work constructs lexical chains based on syntactic relations of words using the Roget's Thesaurus without word sense disambiguation (Morris & Hirst, 1991). Later work expands lexical chains by WordNet relations with word sense disambiguation (Barzilay & Elhadad, 1999;Hirst & St-Onge, 1998). Lexical chains is also constructed based on word-embedded relations for disambiguation of multi-words (Ehren, 2017). The present study constructs lexical chains for proper lag-calibration based on global word-embedding.

Problem Formulation
The problem in the present paper can be formulated as follows. The input is a set of time-sync comments, = { % , ' , ( , … , * } with a set of timestamps = { % , ' , ( , … , * } of a video , a compression ratio 123142315 for number of highlights to be generated, a compression ratio 67889:; for number of comments in each highlight summary. Our task is to (1) generate a set of highlight shots ( ) = { % , ' , ( , … , @ }, and (2) highlight summaries Α = { % , ' , ( , … , @ } as close to ground truth as possible. Each highlight summary comprises a subset of all the comments in this shot: 2 = { % , ' , ( , … , @ C }. Number of highlight shots and number of comments in summary 2 are determined by 123142315 and 67889:; respectively.

Video Highlight Detection
In this section, we introduce our framework for highlight detection. Two preliminary tasks are also described, namely construction of global timesync comment word embedding and emotion lexicon.

Word-Embedding of Time-Sync Comments
As pointed out earlier, one challenge in analyzing time-sync comments is the semantic sparseness, since number of comments and comment length are both very limited. Two semantically related words may not be related if they do not co-occur frequently in one video. To compensate, we construct a global word-embedding on a large collection of time-sync comments.

Emotion Lexicon Construction
As emphasized earlier, it is crucial to extract emotions in time-sync comments for highlight detection. However, traditional emotion lexicons cannot be used here, since there exist too many Internet slangs that are specifically born on this type of platforms. For example, "23333" means "ha ha ha", and "6666" means "really awsome". Therefore, we construct an emotion lexicon tailored for time-sync comments from the word-embedding dictionary trained from last step. First we manually label words of the five basic emotional categories (happy, anger, sad, fear and surprise) as seeds (Ekman, 1992), from the top frequent words in the corpus. Here the sixth emotion category "disgust" is omitted because it is relatively rare in the dataset, and could be readily incorporated for other datasets. Then we expand the emotion lexicon by searching the top neighbors of each seed word in the word-embedding space, and adding a neighbor to seeds if the neighbor meets at least percentage of overlap MNO:49P with all the seeds with minimum similarity of 82@ . The neighbors are searched based on cosine similarity in the word-embedding space.

Lag-Calibration
In this section, we introduce our method for lagcalibration following the steps of concept mapping, word-embedded lexical chain construction and lag-calibration.

Concept Mapping
To tackle the issue of semantic sparseness in timesync comments, and to construct lexical-chains of semantically related words, words of similar meanings should be mapped to same concept first. Given a set of comments of video , we first propose a mapping ℱ from the vocabulary T of comments to a set of concepts T , namely: More specifically, mapping ℱ maps each word X into a concept = ℱ( X ): and _ ( X ) returns the top neighbors of word X based on cosine similarity. For every word X in comment , we check percentage of its neighbors already mapped to a concept . If the percentage exceeds the threshold MNO:49P , then word X together with its neighbors will be mapped to . Otherwise they will be mapped to a new concept X .

Lexical Chain Construction
The next step is to construct all lexical chains in current time-sync comments of video , so that lagged comments could be calibrated based on lexical chains. A lexical chain 2m comprises a set of triples 2m = , , , where is the actual mentioned word of concept 2 in comment , is the timestamp of the comment . A lexical chain dictionary 4On2o94 o192@ for time-sync comments of video : 4On2o94 o192@ = { % : %% , %' , %( … , ' : '% , '' , '( … , … , q r : ( q r % , q r ' , q r ( … )},where 2 ∈ T is a concept, and 2m is the ℎ lexical chain of concept 2 . The algorithm for lexical chain construction is described in Algorithm 1.
Specifically, each comment in can be either appended to existing lexical chains, or added to new empty lexical chains, based on its temporal distance with existing chains controlled by Maximum silence 89n .
Note that word senses in the lexical chains constructed here are not disambiguated as most traditional algorithms do. Nevertheless, we argue that lexical chains are still useful, since our concept mapping is constructed from time-sync comments in its natural order, a progressively semantic continuity that naturally reinforces similar word senses for temporally close comments. This semantic continuity together with global word embedding ensures that our concept mapping is valid in most cases.

Comment Lag-Calibration
Now given constructed lexical chain dictionary 4On2o94 o192@ , we can calibrate the comments in based on their lexical chains. From our observation, the first comment about one shot usually occurs within the shot, while the rest may not be the case. Therefore, we calibrate the timestamp of each comment to the timestamp of first element of the lexical chain it belongs to. Among all the lexical chains (concepts) a comment belongs to, we pick the one with highest score e,o . e,o is computed as the sum frequency of each word in the chain weighted by its logarithm global frequency log ( . ). Therefore,

Algorithm 1 Lexical Chain Construction
Input time-sync comments . Word-to-concept mapping ℱ. Maximum silence 89n . Output A dictionary of lexical chains each comment will be assigned to its most semantically important lexical-chain (concept) for calibration. The algorithm for the calibration is described in Algorithm 2.
Note that if there are multiple consecutive shots { % , ' , … , 8 } with comments of similar contents, our lag-calibration method may calibrate many comments in shots ' , ( , … , 8 to the timestamp of the first shot % , if these comments are connected via lexical chains from shot % . This is not necessarily a bad thing since we hope to avoid selecting redundant consecutive highlight shots and leave opportunity for other candidate highlights, given a fixed compression ratio.

Shot Importance Scoring
In this section, we first segment comments by shots of equal temporal length 6oO@O , then we model shot importance. Then highlights could be detected based on shot importance.
A shot's importance is modeled to be impacted by two factors: comment concentration and commenting intensity. For comment concentration, as mentioned earlier, both concept and emotional concentration may contribute to highlight detection. For example, a group of conceptconcentrated comments like "the background music/bgm/soundtrack of this shot is classic/inspiring/the best" may be an indicator of a highlight related to memorable background music. Meanwhile, comments such as "this plot is so funny/hilarious/lmao/lol/2333" may suggest a single-emotion concentrated highlight. Therefore, we combine these two concentrations in our model. First, we define emotional concentration O8M52M@ of shot based on time-sync comments 6 given emotional lexicon as follows: Here we calculate the reverse of entropy of probabilities of five emotions within a shot as emotion concentration. Then we define topical concentration 5MP2o : where we calculate the reverse of entropy of all concepts within a shot as topic concentration. The probability of each concept is determined by sum frequencies of its mentioned words weighted by their global frequencies, and divided by those values of all words in the shot. Now the comment importance ℐ oM88O@5 6 , of shot can be defined as: where is a hyper-parameter, controlling the balance between emotion and concept concentration. Finally, we define the overall importance of shot as: • log ( 6 ) Where ¡ is the length for all time-sync comments in shot , which is a straightforward yet effective indicator of comment intensity per shot. Now the problem of highlight detection can be modeled as a maximization problem:

Video Highlight Summarization
Given a set of detected highlight shots ( ) = { % , ' , ( , … , @ } of video , each with all the lagcalibrated comments 6 of that shot, we are at- tempting to generate summaries Α = { % , ' , ( , … , @ } so that 6 ⊂ 6 with compression ratio 67889:; and 6 is as close to ground truth as possible. We propose a simple but very effective summarization model, an improvement over SumBasic (Nenkova & Vanderwende, 2005) with emotion and concept mapping and two-level updating mechanism.
In the modified SumBasic, instead of only down-sampling the probabilities of words in a selected sentence to prevent redundancy, we downsample the probabilities of both words and their mapped concepts for re-weighting each comment. This two-level updating mechanism could: (1) impose a penalty for sentences with semantically similar words to be selected; (2) still select a sentence with word already in the summary if this word occurs much more frequently. In addition, we use a parameter emotion bias b ±²•³´•µ to weight words and concepts when computing their probabilities, so that frequencies of emotional words and concepts will increase by b ±²•³´•µ compared to non-emotional words and concepts.

Experiment
In this section, we conduct experiments on large real datasets for highlight detection and summarization. We will describe the data collection process, evaluation metrics, benchmarks and experiment results.

Data
In this section, we describe the datasets collected and constructed in our experiments. All datasets and codes will be made publicly available on Github 2 .

Crowdsourced Time-sync Comment Corpus
To train the word-embedding described in 4.1.1, we have collected a large corpus of time-sync comment from Bilibli 3 , a content sharing website in China with time-sync comments. The corpus contains 2,108,746 comments, 15,179,132 tokens, 91,745 unique tokens, from 6,368 long videos. Each comment has 7.20 tokens on average.
Before training, each comment is first tokenized using Chinse word tokenization package Jieba 4 . Repeating characters in words such as "233333", "66666", "哈哈哈哈" are replaced with two same characters.
The word-embedding is trained using word2vec (Goldberg & Levy, 2014) with the skipgram model. Number of embedding dimensions is 300, window size is 7, down-sampling rate is 1e-3, words with frequency lower than 3 times are discarded.

Emotion Lexicon Construction
After the word-embedding is trained, we manually select emotional words belonging to the five basic categories from the 500 most-frequent words in the word-embedding. Then we expand the emotion seeds iteratively using algorithm 1. After each expansion iteration, we also manually examine the expanded lexicon and remove inaccurate words to prevent the concept-drift effect, and use the filtered expanded seeds for expansion in next round. The minimum overlap MNO:49P is set to be 0.05, and minimum similarity 82@ is set to be 0.6. The selection of MNO:49P and 82@ is selected based on grid search in the range of 0,1 . The number of words for each emotion initially and after final expansion are listed in Table 3.

Video Highlights Data
To evaluate our highlight-detection algorithm, we have constructed a ground-truth dataset. Our ground-truth dataset takes advantage of useruploaded mixed-clips about a specific video on Bilibli. Mixed-clips are a collage of video highlights by the user's own preferences. Then we take the most-voted highlights as ground-truth for a video.
The dataset contains 11 videos of 1333 minutes in length, with 75,653 time-sync comments in total. For each video, 3~4 video mix-clips about this video are collected from Bilibili. Shots that occur in at least 2 of all the mix-clips are considered as ground-truth highlights. All ground-truth highlights are mapped to the original video timeline, and the start and end time of the highlight are recorded as ground-truth. The mix-clips are selected based on the following heuristics: (1) The mixedclips are searched on Bilibli using the keywords  "video title + mixed clips"; (2) The mixed-clips are sorted by play times in descending order; (3) The mix-clip should be mainly about highlights of the video, not a plot-by-plot summary or gist; (4) The mix-clip should be under 10 minutes; (5) The mix-clip should contain a mix of several highlight shots instead of only one. On average, each video has 24.3 highlight shots. The mean shot length of highlights is 27.79 seconds, while the mode is 8 and 10 seconds (fre-quency=19).

Highlights Summarization Data
We also construct a highlight-summarization (labeling) dataset of the 11 videos. For each highlight shot with its comments, we ask annotators to construct a summary of these comments by extracting as many comments as they see necessary.
The rules of thumb are: (1) Comments of the same meaning will not be selected more than once; (2) The most representative comment for similar comments is selected; (3) If a comment stands out on its own, and is irrelevant to the current discussion, it will be discarded.
For 11 videos of 267 highlights, each highlight has on average 3.83 comments as its summary.

Evaluation Metrics
In this section, we introduce evaluation metrics for highlight-detection and summarization.

Video Highlight Detection Evaluation
For the evaluation of video highlight detection, we need to define what is a "hit" between a highlight candidate and reference. A rigid definition would be a perfect match of beginnings and ends between candidate and reference highlights. However, this is too harsh for any models. A more tolerant definition would be whether there is an overlap between a candidate and reference highlight. However, this will still underestimate model performance since users' selection of beginning and end of a highlight can be quite arbitrary some times. Instead, we propose a "hit" with relaxation between a candidate ℎ and the reference as follows: Where 1 , 1 is the start time and end time of highlight ℎ, and is the relaxation length of reference set . Further, the precision, recall and F-1 measure can be defined as: In present study, we set the relaxation length to be 5 seconds. Also, the length for a candidate highlight is set to be 15 seconds.

Video Highlight Summarization Evaluation
We use ROUGE-1 and ROUGE-2 (C.-Y. Lin, 2004) as recall of candidate summary for evaluation: We use BLEU-1 and BLEU-2 (Papineni, Roukos, Ward, & Zhu, 2002) as precision. We choose BLEU for two reasons. First, a naïve precision metric will be biased for shorter comments, and BLEU can compensate this with the product factor: Where is the candidate summary and is the reference summary. Second, while reference summary contains no redundancy, candidate summary could falsely select multiple comments that are very similar and match to the same keywords in reference. In such case, the precision is extremely overestimated. BLEU will only count the match one-by-one, namely the number of match of a word will be the minimum frequencies in candidate and reference.
Finally, the F-1 measure can be defined as:

Benchmarks for Video Highlight Detection
For highlight detection, we provide comparisons of different combinations of our model with three benchmarks: • Random-selection. We select highlight shots randomly from all shots of a video. • Uniform-selection. We select highlight shots at equal intervals.
• Spike-selection. We select those highlight shots who have the most number of comments within the shot. • Spike+E+T. This is our method taking into consideration of emotion and topic concentration without the lag-calibration step. • Spike+L. This is our method with only the lag-calibration step without taking into consideration of content concentration. • Spike+L+E+T. This is our full model.

Benchmarks for Video Highlight Summarization
For highlight summarization, we provide comparisons of our method with five benchmarks: • SumBasic. Summarization that exclusively exploits frequency for summary construction (Nenkova & Vanderwende, 2005) . • Latent Semantic Analysis (LSA). Summarization of text based on singular value decomposition (SVD) for latent topic discovery (Steinberger & Jezek, 2004). • LexRank. Graph-based summarization that calculates sentence importance based on the concept of eigenvector centrality in a graph of sentences (Erkan & Radev, 2004). • KL-Divergence. Summarization based on minimization of KL-divergence between summary and source corpus using greedy search (Haghighi & Vanderwende, 2009). • Luhn method. Heuristic summarization that takes into consideration of both word frequency and sentence position in an article (Luhn, 1958).

Experiment Results
In this section, we report experimental results for highlight detection and highlight summarization.

Results of Highlight Detection
In our highlight detection model, the threshold for cutting a lexical chain 89n is set to be 11 seconds, the threshold for concept mapping MNO:49P is set to be 0.5, threshold for concept mapping _ is set to be 15, and the parameter to control balance of emotion and concept concentration is set to be 0.9. A parameter analysis is provided in section 7.
The comparisons of precision, recall and F1 measures of different combinations of our method and the benchmarks are in Table 4. Our full model (Spike+L+E+T) outperforms all other benchmarks on all metrics. The precision and recall for Random-selection and uniform selection are low since they do not incorporate any structural or content information. Spike-selection improves considerably, since it takes advantage of the comment intensity of a shot. However, not all commentintensive shots are highlights. For example, comments at the beginning and end of a video are usually high-volume greetings and goodbyes as a courtesy. Also, spike-selection usually condenses highlights on consecutive shots with high-volume comments, while our method could jump and scatter to other less intensive but emotionally or conceptually concentrated shots. This can be observed by the performance of Spike+E+T.
We also observe that lag-calibration (Spike+L) alone improves the performance of Spikeselection considerably, partially confirming our hypothesis that lag-calibration is important in time-sync comment related tasks.

Results of Highlight Summarization
In our highlight summarization model, the emotional bias O8M52M@ is set to be 0.3.
The comparisons on 1-gram BLEU, ROUGE and F1 of our method and the benchmarks are in Table 5. Our method outperforms all other methods, especially on ROUGE-1. LSA has lowest BLEU, mainly because LSA favors long and multi-word sentences statistically, however these sentences are not representative in time-sync com-  ments. The SumBasic method also performs relatively poor since it considers semantically related words separately unlike our method that use concepts instead of words. The comparisons on 2-gram BLUE, ROUGE and F1 of our method and the benchmarks are in Table 6. Our method also outperforms all other methods.
From the results, we believe that it is crucial to perform lag-calibration as well as concept and emotion mapping before summarization of timesync comment texts. Lag-calibration shrinks prolonged comments to its original shots, preventing inaccurate highlight detection. Concept and emotional mapping works because time-sync comments are usually very short (7.2 tokens on average), the meaning of the comment is usually concentrated on one or two "central-words" in the comment. Emotion mapping and concept mapping could effectively prevent the redundancy in the generated summary.

Influence of Shot Length
We analyze the influence of shot length on 1 score for highlight detection. First from the distribution of highlight shot lengths in golden standards (Figure 2), we observe that most of the highlight shot lengths lie in the range of [0,25] (seconds), with 10 seconds as the mode. Therefore, we plot the 1 score of all four models at different shot lengths ranging from 5 to 23 seconds ( Figure  3).
From Figure 3 we observe that (1) our method (Spike+L+E+T) consistently outperforms the other benchmarks at varied shot lengths; (2) however, the advantage of our method over Spike method seems to be moderated as the shot length increases. This is reasonable, because as the shot length becomes longer, the number of comments in each shot accumulates. After certain point, shot with significantly more comments will signify as highlight, no matter of the emotions and topics it contains. However, this may not always be the case. In reality, when there are too few comments, detection totally relying on volume will fail; on the other hand, when there are overwhelming volumes of comments evenly distributed among shots, spikes may not be a good indicator since every shot has equally large volumes of comments now. Moreover, most highlights in reality are below 15 seconds, and Figure 3 shows that our method could detect highlights more accurately at such finer level.

Parameters for Highlight Detection
We analyze the influence of four parameters on recall for highlight detection: maximum silence for lexical chains l ²ÕÖ , the threshold for concept mapping ϕ •Ø±ÙŽÕÚ , the number of neighbors for concept mapping top_n, and the balance of emotion and concept concentration λ (Figure 4).
From Figure 4, we observe the following: (1) when it comes to lag-calibration, there seems to be an optimal Max Silence Length: 11 seconds as the longest blank continuance of a chain for our dataset. This value controls the compactness of a lexical chain. (2) In concept mapping, the Minimum Overlap with Existing Concepts controls the threshold for concept-merge, the higher the    threshold the more similar the two merged concepts are. The recall increases as overlap increase to a certain point (0.5 in our dataset), and will not improve further after such point.
(3) In concept mapping, there seems to be an optimal Number of Neighbors for searching (15 in our dataset). (4) The balance between emotion and concept concentration (lambda) is more on the emotion side (0.9 in our dataset).

Parameter for Highlight Summarization
We also analyze the influence of emotion bias b ±²•³´•µ on ROGUE-1 and ROGUE-2 for highlight summarization. The results are depicted in Figure 5. From Figure 5, we observe that when it comes to highlight summarization, emotion plays a moderate role (emotion bias = 0.3). This is less significant than its role in the highlight detection task, where emotion concentration is much more important than concept concentration.

Conclusion
In this paper, we propose a novel unsupervised framework for video highlight detection and summarization based on crowdsourced time-sync comments. For highlight detection, we develop a lag-calibration technique that shrinks lagged comments back to their original scenes based on concept-mapped lexical-chains. Moreover, video highlights are detected by scoring of comment intensity and concept-emotion concentration in each shot. For highlight summarization, we propose a two-level SumBasic that updates word and concept probabilities at the same time in each iterative sentence selection. In the future, we plan to integrate multiple sources of information for highlight detection, such as video meta-data, audience profiles, as well as low-level features of multiple modalities through video-processing.