Cold Start Problem For Automated Live Video Comments

Live video comments, or ”danmu”, are an emerging feature on Asian online video platforms. Danmu are time-synchronous comments that are overlaid on a video playback. These comments uniquely enrich the experience and engagement of their users. These comments have become a determining factor in the popularity of the videos. Similar to the ”cold start problem” in recommender systems, a video will only start to attract attention when sufficient danmu comments have been posted on it. We study this video cold start problem and examine how new comments can be generated automatically on less-commented videos. We propose to predict the danmu comments by exploiting a multi-modal combination of the video visual content, subtitles, audio signals, and any surrounding comments (when they exist). Our method fuses these multi-modalities in a transformer network which is then trained for different comment density scenarios. We evaluate our proposed system through both a retrieval based evaluation method, as well as human judgement. Results show that our proposed system improves significantly over state-of-the-art methods.


Introduction
Live video comments, or "danmu", is an emerging feature of video sharing platforms such as Bilibili and Nicovideo, which has been adopted by hundreds of millions of users in Asia. Danmu comments are a time-synchronous commentary subtitle system that displays user comments as streams of moving subtitles overlaid on the video playback screen (see Fig. 1). Danmu comments have become a key feature of these video platforms. So much so, that videos with many danmu comments stand a higher chance of being recommended or searched, and naturally attract more viewers.
This new form of media consumption comes with a vast amount of annotated video data and Figure 1: A video frame from bilibili.com with damnu comments overlaid. The lower part of the image shows danmu comment distribution over the video. The subtitle says: "could you publish some danmu?" and the viewers are responding with a damnu burst.
opens the path to multiple new research strands for video technologies, including automated highlighting, summarization and conversational engagement. The main focus of the research literature (see Section 2) has so far been on the automatic generation of danmu comments (Lv et al., 2019;Ma et al., 2019;Weiying et al., 2020). In particular, Shuming et al. (Ma et al., 2019) recently proposed in "Livebot", a new benchmark with a baseline unified transformer architecture to automatically generate new danmu comments from existing danmu comments and video content. This literature has mostly focused on the analysis of videos that already have many comments. This is however probably not the most critical scenario for automated danmu generation as these videos are already popular. Also, it is easier in these cases to exploit the numerous nearby comments to generate new comments. Similar to the "cold start problem" in recommender systems, the real issue faced by content creators is that videos need many danmu comments to start attracting traffic.
In this paper we propose to solve this "video cold start problem" by a method that can generate danmu comments on videos which have zero, few, or many comments. We propose a multi-density cold video transformer (MCVT) that can leverage multi-modal signals including surrounding comments, video frames, but also subtitles and audio signals in an end-to-end neural network (see Section 4). The key idea is then to approach the task globally and train the network for different comment density scenarios (see Section 5). To achieve this, we collect the publishing timestamps of comments from the video platform and look at the sequence of the comment publishing times (see section 3). This allows us to consider different snapshots of a video's commenting lifetime (ie. when the video was freshly uploaded with no comments, then when it had a few comments, and later with many comments). This information has not been exploited in existing work described in the literature, but we show that it can be used effectively in training of danmu generation. We evaluate our system in Section 6 through both a retrieval based evaluation method and human judgement. Results show that our system is able to produce comments that are close to the quality of human comments. The key contributions of this paper are as follow: • We are the first to investigate the cold video problem for automated creation of danmu for videos which enables us to create comments for freshly uploaded videos.
• We expand a publicly available danmu video dataset (Ma et al., 2019) by doubling its size and enriching multi-modal features from video embedded subtitles.
• We propose a multi-density cold video transformer (MCVT) architecture and training framework which can generate high quality comments with different comment density and outperforms state-of-the-art method.
To make our work fully reproducible, both the source codes and the dataset used have been made public available. 1

Related Work
In this section we introduce existing work on automated danmu generation, detection of video highlights based both on manually contributed danmu and atomated analysis of video content, and automated creation of descriptive captions for videos. 1 https://github.com/fireflyHunter/Cold-Video-Danmu-Generation

Danmu Generation
The earliest work in danmu content generation was based on a generative adversarial model, where the video frames are directly mapped into the comments textual space (Lv et al., 2019). This method, however, does not exploit existing nearby comments. Ma et al. (2019) proposed LiveBot which combines both visual and textual contexts in an encoding phase with a Transformer architecture. They also proposed evaluation metrics and released a publicly accessible training set. This work has served as a benchmark for the most recent approaches (Zhang et al., 2020;Chaoqun et al., 2020;Weiying et al., 2020). In previous work, we reworked the baseline implementationof LiveBot to address several shortcomings in both the original dataset and implementation (Wu et al., 2020).
We note that LiveBot, and its successors, are trained on densely commented videos, and use all available comments to make predictions. Thus, they do not consider what will in practice be the more useful setting for automatec danmu creation of videos with few or no comments, which we refer to as the cold start scenario. Also, they do not make use of all of the attributes of the comments. In particular, the publishing time of the comments is not included in the training set. This means that the causality between comments is lost and that the target comments could potentially predate the proposed contextual comments. Also, these methods do not consider where to publish in the video timeline.

Highlight Detection
Video highlights could provide pointers for comment generation, some prior work has tried to predict popular segments in videos. Video highlights, as they are called, can be identified by looking at the current distribution of published danmu comments (see plot in Fig. 1). This is the idea exploited in (Xu et al., 2017), where a personalised framelevel recommendation is based on the analysis of published comments. More relevant to the cold start problem is highlight prediction solely from video content, as proposed in (Zheng et al., 2020) using a bi-directional Long-short Term Memory (LSTM) architecture.

Video Captioning
Related to our application is the task of video captioning, which aims to generate descriptive sen- tences of a video sequence. Current architectures for this usually follow an encoder-decoder pattern.
In the encoder, the sequence of video frames is embedded by a CNN (Subhashini et al., 2014) or RNN (Nitish et al., 2015). The decoder, typically an LSTM, generates captions from the contextual output of the encoder. Techniques like reinforcement learning (Xin et al., 2018), contextual-aware video captioning (Spencer et al., 2018) and semantic attention model (Gan et al., 2017) have also been explored by researchers in this field. What emerges from the recent literature is that the Transformer architecture, as proposed in Livebot, has become the state-of-the-art approach for multi-modal text generation applications and thus we adopt this as the baseline for our application.

Task Overview
In this section we define our danmu creation task, introduce the dataset used in our work and outline the video content extraction methods used in this investigation.

Task Definition
To address the cold start problem we aim to be able to generate high quality comments given videos with different comment densities. In order to handle different danmu density scenarios of the cold start problem, we first sort the existing comments C for a video by their publication time and only keep a subset C p consisting of a percentage p of the earliest comments of the video. This strategy is enforced to reconstruct video danmu comments in different phases of their lifetime. Then we define our task as follows: given a video V = {s 0 , . . . , s L } (following accepted convention V is split into segments of one second duration), the generation module is asked to generate a target comment y using comments from C p and the k previous seconds of the video clip s [i−k,i] .

Dataset
For our investigation, we constructed a large-scale dataset with 4,672 videos and 2,789,360 danmu comments, which is publicly available 2 . Part of the data (2,322 videos and 857,993 comments) comes from the publicly available automatic danmu generation Response to Livebot dataset (Wu et al., 2020). As our task aims to generate comments for videos with low comment densities compared to a general comment creation list, the size of the suitable training data is reduced significantly during the reconstruction of the cold start scenarios . We thus added another 2,350 videos from the same danmu video website (bilibili.com) to the dataset. The Livebot dataset is mainly themed around natural life, to keep it consistent, the appended videos were selected by having a web crawler pick the 100 most popular "Daily Life" category videos of the recent three days everyday for two months. Fig. 2  A key contribution of our paper is that we take into account the publication timestamp of each of the danmu comments. The training data for a particular level p, percentage of existing manual comments preserved, is defined as follows. Each target comment for the training set is randomly sampled from the original comment set C and the corresponding comment's context is defined as the 5 nearest comments from C p that precede the target danmu in the video timeline. This follows the observation made in Livebot (Ma et al., 2019), that the semantic and textual similarity of comments is correlated to their timeline proximity and that the danmu context should be limited to the 5 nearest comments. We also add a causality constraint by applying the constraint that the comments must have been published before the target danmu in natural time.
We sample the training data for p = 0%, 5%, 30%, 50%, 70% and 100%, to form a combined training set of 4,800,145 pairs of target comment/context comments. Target comments can be sampled multiple times for different contexts.
For the 200 videos of the test set, we focus on the video highlights by only selecting 1879 comments in the most frequently commented moments in the video timeline. To study the system performance under different comment densities, we build one test set for each of the proposed values of p.

Video Information Extraction
We further augment the complete danmu commenting dataset multi-modally by extracting the audio and the subtitle information in addition to the visual and textual comment information. We believe that these additional features will help with the cold start problem.
Visual & Audio Signals. We follow standard practice by sampling one video frame per second of video. The frame from the i-th second of the video is denoted as f i . The audio soundtrack is extracted from a video and uniformly re-sampled using a 16kHz standard. Subtitles. We observe that human created danmu comments frequently respond to speech in the video. Fig. 1 shows an example of it: viewers are asked in the subtitles, to post danmu comments. This motivates us to transcribe the speech from the videos. Instead of using speech recognition, we opt to use optical character recognition (OCR). We found that the quality of transcripts produced by speech recognition tools was by comparison of poor quality. While most of the videos on the platform embed speech subtitles that OCR tools can accurately identify. Lastly, captions also display non-speech information which could be exploited. For OCR, we use the open-source Tesseract (Kay, 2007) OCR engine on the lower half of the sampled video frames.
Note that only 109 videos out of 4672 videos contained zero recognisable text and each video contains an average of 13.97 unique subtitles (see Fig. 3).

Network Architecture
Our proposed model, presented in Fig 4, applies standard Transformer modules with an encoderdecoder architecture. During the encoding stage, visual, audio and text features are first encoded respectively, then three transformer modules are used to fuse the information for the three modalities recursively. In the decoder, the target comment is decoded through a transformer layer with multiple multi-head attention modules that attend to three encoded multi-modal representations respectively.

Video Encoder
As in (Ma et al., 2019), video frames are encoded through a pre-trained 18-layer ResNet. We take the output from the last pooling layer of ResNet as visual feature, the frame vector of the i-th second of the video is denoted as v i ∈ R n 18 , where n 18 = 512 is the size of the resulting ResNet18 features. The frame vectors in the video clip are combined asv i = {v i−k , . . . , v i }.

Audio Encoder
For the audio signal, we use 20-dimensional melfrequency cepstral coefficients (MFCCs) and another 20-dimensional MFCCs derivatives as audio frame features (Di Gangi et al., 2019). These are extracted with a Hanning window of 40 ms length and 32 ms hop size. We include all audio frames as the audio input, hence we sample 32 audio vectors for each second of the audio. The audio information at time point i is denoted as a j i , where j is the j-th audio frame vector in the window analysis at time i. A GRU module (Chaoqun et al., 2020) is applied to recursively encode the input audio sequence. At each stage, the current hidden state h j i is calculated based on the last hidden state h j−1 i and the current input audio frame vector a j i . The sequence of hidden states h j i for all audio frames is concatenated into an audio encoder outputâ i ∈ R na×512 , where n a = 32 × k is the number of audio frames in the analysis window and 512 the dimension of the hidden state.

Text Encoder
Contextual comments are concatenated with a special delimiter token T d inbetween each comment and then combined with the unique subtitles from the analyzed k second window. As opposed to Livebot (Ma et al., 2019), where there are always 5 context comments, in our cold start scenario we sometimes have less than 5 and even 0 comments. In the extreme case we use a special token T n with an empty comment field to show that no context comments are available.
All unique subtitles within analysis window s [i−k,i] are also concatenated with the same delimiter token. Finally, we form the text input by combining comment sequence and subtitle sequence with T d .
We remove the punctuation and segment words using Jieba (an open-sourced Chinese text segmentation tool). Each word of text input is then passed to an embedding layer of size d × |V |, where d is the dimension of the word embedding and |V | is the size of the vocabulary. After embedding, the text input for analysis window s [i−k,i] , is now represented asê i ∈ R n×d .

Fusion of Modalities
Following the success of the Transformer architecture in multi-modal processing (Ma et al., 2019;Chaoqun et al., 2020), we adopt a multi-unit Trans-former module to recursively learn and combine representations from all three modalities. The Transformer unit first encodes the text inputê i into a transitional hidden state H e . Then, a second transformer unit combines H e and the input audio with two multi-head attention modules, the first one attending toâ i and the second one attending to H e . Finally, another unit with three multi-head attention modules is used to summarise the video clip representation H vae .

Decoder
In the model decoder, the output comment is generated through a transformer layer with 4 multi-head attention modules that attend to the target comment y, text hidden state H e , visual hidden state H ae and audio hidden state H vae respectively. Then the probability of output comment is produced with an softmax layer on top of the decoder output.

Multi-Density Learning
A key aspect of our method is to consider all the different cold start scenarios together by adopting a multi-task training strategy.
In detail, our training regime is implemented by randomly assigning, at each minibatch, the percentage p of earlier comments that are kept from a fixed set of values {0%, 5%, 30%, 50%, 70%, 100%}. Recall that p = 0% corresponds to the cold start problem, and p = 100% corresponds to the situation where all other comments are available (such as in Livebot (Ma et al., 2019)). By alternating between these values of p, we are able to train the network for both the cold start and Livebot scenario.

Training Detail
The video analysis window size k is set to 5 (s). For the text input, we build the vocabulary by selecting the most frequent 50,000 words in the dataset and set the max length of the input text sequence to 50. In the model, the text embedding is of size 512 and is randomly initialized before training. The dimension of the audio's GRU hidden state is set to 512. We apply the same setting for all transformer components used in the network. For each transformer, the hidden state dimension is set to 512, the feed forward network dimension is 2048, the number of heads is 8 and the number of blocks is 6. The loss criterion is cross-entropy. The number of epochs is set to 10, the batch size to 64 and we use the Adam optimizer (Kingma and Ba, 2014) with settings β 1 = 0.9, β 2 = 0.998, weight decay =1×10 −4 , = 1×10 −8 and learning rate 1×10 −4 . All training was done on a Linux server with a single RTX 2080 Ti graphic card, 16 cores Intel(R) Xeon(R) CPU E5-2623 v4 @ 2.60GHz and 256GB RAM. The model is implemented using Pytorch 1.4.0 and Python 3.6. With above settings, it takes around 34 hours to complete the training.

Experiments
In this section we report results for our investigation of comment generation. We use the Livebot model (Ma et al., 2019) as a baseline. Specifically, we use the code from (Wu et al., 2020), trained on our full dataset with only video frames and surrounding comments as input. The models proposed in (Chaoqun et al., 2020;Zhang et al., 2020) are very recent and their code is not publicly available yet, so we do not consider these as one of our baseline methods. Other older neural architectures such as LSTM are also not included in this study since it is well established that Transformers are the method of choice for modelling multi-modal signals.

Evaluation
We note that reference-based metrics for generation tasks like BLEU and ROUGE are not suitable for evaluation of video comments (Das et al., 2017;Ma et al., 2019;Zhang et al., 2020). Hence we follow (Das et al., 2017) and focus on the ability to rank the correct comment originally appearing at this point in the video over other comments taken from the dataset. We evaluate our system through a retrieval based protocol: the model is asked to re-rank a candidate set for each test sample. The comment set for re-ranking is made of 100 comments, including 5 correct groundtruth comments for this point in the video, the 20 most similar comments to the title of the video based on tf-idf score (plausible candidates), the 20 most frequent comments in the dataset and 55 randomly sampled comments.
We report the Recall@k, Precision@k, Mean Rank (MR) and Mean Reciprocal Rank (MRR) as evaluation metrics on this retrieval task. The confidence interval is reported for each of these metrics with confidence level at 95% (for R@k, we use the confidence interval for population proportions).

Ablation Study
The retrieval task results are reported in Table 3 and Figure 5. In this ablation study, we compare 4 variants of the model.
• Livebot (Ma et al., 2019) leverages textual and visual information in a Transformer architecture. It is trained on the extended dataset using the implementation provided in (Wu et al., 2020). The training is done here with p=100%.
• Livebot-t applies the same network architecture as textbfLivebot, but is trained with our multi-density training strategy to evaluate the effectiveness of our proposed training regime.
• MCVT is the final system proposed in this work, which includes the training regime and the inclusion of the additional audio and subtitle features.
• MCVT-Zero is listed to further examine the performance limit in the cold start scenario, i.e. we assume a situation where no comments are present. Thus, we train the MCVT network uniquely on the cold start scenario for p = 0% The results in Table 3 show that Livebot-t outperforms the baseline Livebot model in most cases, and thus demonstrates the effectiveness of our training strategy. One exception is found when p = 100%, the Livebot model, trained only with densely commented videos, slightly outscores Livebot-t, we think this means the information learned from multi-density training strategy produces extra noise when the model only aims to  Table 3). generate comments for popular videos. By contrast, from the third and fourth rows of Table 3, we can see that our MCVT model has similar performance to MCVT-zero, which has been trained specifically for the complete cold start scenario.
In this situation, the extra knowledge gained from learning popular videos does not appear to affect the performance in the cold start situation. This comparison between the behaviour of the Livebot and MCVT systems potentially demonstrates the advantage of our training regime in the case of cold start scenario.
We also see that our model outperforms Livebott in every scenario, which also supports the idea that integrating the audio signal and subtitle in the generation system can significantly improve the performance of the model.

Human Evaluation
Additionally, we also use human judgements to obtain a more intuitive and reliable measurement of the generated comments. A subset of 50 videos was randomly sampled from the 200 videos of the test set. Three native Chinese speakers familiar with danmu were asked to rate the quality of the generated comments on three criteria: fluency, relevancy and engagement.
• Fluency is intended to measure the language quality of the generated comment.
• Relevancy measures the semantic relevancy between the generated comment and the input video and nearby comments.
• Engagement should reflect how likely it is that the generated comment will motivate others to respond.  Table 2: Human evaluation on 50 videos from the test set. Each comment is graded between 1 and 5, by 3 reviewers, for their language fluency, relevance to the video content and on how likely they are to provoke other viewers to also comment.
The score for all 3 measurements ranges from 1 (poor) to 5 (excellent). The final score is the average of the scores of the three annotators. The evaluation was conducted on the comments generated by our method for p ∈ {0%, 5%, 50%}. For reference, we also evaluate the groundtruth comment set for these videos. Table 2 reports the results of this human evaluation. We can see that the overall performance of model is almost indistinguishable from real danmu comments. Our relevancy and engagement scores are actually higher when p ≥ 50%. The quality of our model degrades slightly for the complete cold start scenario, but the results are still quite close to human comments.

Case Study
Examples of predicted outputs are shown in Fig. 6. The corresponding video frame shows a groundhog being fed. The subtitle, context comment, generated comments and target comments are reported in the table to the right. We can see that the model generates reasonable comments, which are relevant to the video shot and match the video's positive emotion (e.g. "laugh", "hahaha" and "lol"), even in the case of a complete cold start.

Conclusions and Further Development
In this paper we investigate the cold video start problem in automated danmu comment generation. We propose a multi-modal fusion network which includes processing of video frames, already published comments, and also audio and caption text. We train it for different comment density scenarios and perform extensive experiments on an expanded danmu video dataset. Results demonstrate the advantage of our method over the state-of-the-art in solving the cold video start problem.

Brainwashed by groundhog hahahahahaha
Subtitle: The guy and his friend laughed and pass the biscuit to the groundhog's mouth. Our next research goal is to leverage a highlight detection method in this task to seek to further improve the system performance, since this is expected to reveal areas of likely user interest on the video timeline which could provide pointers for preferred locations for the automated creation of danmu comments.