DEBUG: A Dense Bottom-Up Grounding Approach for Natural Language Video Localization

In this paper, we focus on natural language video localization: localizing (ie, grounding) a natural language description in a long and untrimmed video sequence. All currently published models for addressing this problem can be categorized into two types: (i) top-down approach: it does classification and regression for a set of pre-cut video segment candidates; (ii) bottom-up approach: it directly predicts probabilities for each video frame as the temporal boundaries (ie, start and end time point). However, both two approaches suffer several limitations: the former is computation-intensive for densely placed candidates, while the latter has trailed the performance of the top-down counterpart thus far. To this end, we propose a novel dense bottom-up framework: DEnse Bottom-Up Grounding (DEBUG). DEBUG regards all frames falling in the ground truth segment as foreground, and each foreground frame regresses the unique distances from its location to bi-directional ground truth boundaries. Extensive experiments on three challenging benchmarks (TACoS, Charades-STA, and ActivityNet Captions) show that DEBUG is able to match the speed of bottom-up models while surpassing the performance of the state-of-the-art top-down models.


Introduction
Vision-and-language understanding, e.g., what the vision and text are, and how they relate with each other, is one of the core tasks in both computer vision and natural language processing. To test the machine comprehension of complex video scene and natural language simultaneously, a challenging task was proposed (Gao et al., 2017;Hendricks et al., 2017Hendricks et al., , 2018, called Natural Language Video Localization (NLVL). As shown * Chujie Lu and Long Chen are co-first authors with equal contributions. † Jun Xiao is the corresponding author. in Figure 1, given a natural language description query and an untrimmed video sequence, NLVL needs to localize a segment in the video (i.e., identify the start and end time point of the segment) which semantically corresponds to the reference sentence. Moreover, NLVL is an indispensable technique for many important applications, e.g., text-oriented video highlight detection or retrieval.
Currently, the overwhelming majority of NLVL models are top-down approaches: they first cut a video into a set of segment candidates, then they do classification and regression for each candidate. Specifically, they can be further grouped into two sub-types: 1) sliding-window-based (Gao et al., 2017;Hendricks et al., 2017;Liu et al., 2018b,a;Ge et al., 2019;Chen and Jiang, 2019;Xu et al., 2019;: the video is explicitly segmented by multiple predefined temporal sliding-window scales. After extracting features for the query and all candidates, NLVL degrades into a multimodal matching problem. 2) anchorbased Zhang et al., 2019a): it assigns each frame 1 with multi-scale temporal anchors, which follows the same spirit of anchor box in object detection (Ren et al., 2015).
Although top-down approaches have dominated NLVL for years, it is worth noting that they suffer several notorious limitations: 1) The performance is sensitive to the heuristic rules (e.g., the temporal scales or the number of candidates). 2) In order to achieve a high recall, they are required to densely place candidates, which significantly increase the amount of computation and localization time.
To eliminate these inherent drawbacks in topdown framework, some recent NLVL works (Chen et al., 2019a;Yuan et al., 2019) start to borrow idea from reading comprehension (Xiong et al., 2017(Xiong et al., , 2018Yu et al., 2018), which directly predicts the start and end boundaries. Although this sparse bottom-up approach is highly computationefficient, the localization accuracy, especially for long videos (e.g., TACoS) is behind its top-down counterpart. We argue that the main reasons come from three-fold: 1) Two boundary predictions are independent, i.e., the model ignores the content consistency between two predictions. As an example shown in Figure 2 (a), two frames in B and D have a similar visual appearance. Thus, the model is prone to predict the result as (A→D), without considering the distinct content change in range (B→C). 2) The positive and negative training samples are extremely imbalanced. Since the number of video frames is large (e.g., each video in TACoS has average 9,000 frames), but the positive training samples are sparse, i.e., only two frames (Figure 2 (b)). 3) Detecting temporal action boundary from frames, i.e., predicting a frame is queryrelated and at temporal boundary simultaneously by a single network, is still a challenging task, even without query constraint (Shou et al., 2018).
In this paper, we propose a dense bottom-up framework for NLVL: DEnse Bottom-Up Grounding (DEBUG), to mitigate the problems in exist-ing NLVL frameworks. Specifically, we regard all frames falling in the ground truth segment as positive samples (i.e., foreground). For each positive frame, DEBUG has a classification subnet to predict its relatedness with the query, and a boundary regression subnet to regress the unique distances from its location to bi-directional ground truth boundaries. This design helps to disentangle the temporal boundary detection from query-related prediction, relieving the burden of the single classification network in existing sparse bottom-up models. Meanwhile, we can utilize as many positive samples as possible to alleviate the imbalance problem between positive and negative samples (Figure 2 (c)). Since each pair of boundary predictions are based on the same frame feature, i.e., two predictions act as a whole, which helps to avoid falling into the local optimum caused by independent predictions. In addition, we propose a temporal pooling to relieve unstable performance caused by single prediction. Moreover, DEBUG is agnostic to the upstream multimodal interaction network, i.e., it can be seamlessly incorporated into any stronger backbone to boost performance.
We demonstrate the effectiveness of DEBUG on three challenging benchmarks: TACoS (Regneri et al., 2013), Charades-STA (Gao et al., 2017), and ActivityNet Captions (Krishna et al., 2017). Without bells and whistles, DEBUG surpasses the performance of the state-of-the-art models over various benchmarks and metrics at the highest speed.
2 Related Work

Natural Language Video Localization
NLVL is a very difficult task, which needs to understand both complex video scene and natu-ral language simultaneously. Because most of NLVL models are under the top-down framework, they focus on designing more effective multimodal interaction networks, e.g., query-based attention on video frames (Liu et al., 2018a), visualbased attention on language words (Liu et al., 2018b), or co-attention between each frame-andword pairs (Chen et al., , 2019aYuan et al., 2019). It is worth noting that the improvement in multimodal interaction network is orthogonal to the DEBUG, i.e., DEBUG can be seamlessly incorporated into any stronger interaction network.
To the best of our knowledge, there are only two exceptions among all NLVL models: RWM  and SM-RL (Wang et al., 2019), which are not under either top-down or bottomup frameworks. They both formulate NLVL as a sequential decision making problem, solved by reinforcement learning, e.g., actor critic (Chen et al., 2019b). The action space for each step is a set of handcraft-designed temporal box transformations.

Top-Down vs. Bottom-Up
Top-down and bottom-up approaches, which are widely co-exist in many CV and NLP tasks, are two different philosophical viewpoints for solving problems. The most related top-down and bottomup concepts as the one in NLVL frameworks are: Object Detection. Most of the object detectors after Faster- RCNN (Ren et al., 2015) are top-down models, i.e., it predicts classification scores and regression offsets for multiple predefined anchors in each position. These models suffer the same drawbacks as above mentioned in the top-down approach for NLVL. However, with the advent of the first performance comparable bottom-up object detector: CornerNet (Law and Deng, 2018), the bottom-up approaches begin to gain unprecedented attention (Zhou et al., 2019b,a;Duan et al., 2019;, which inspires us to explore a decent bottom-up framework for NLVL. Attention Mechanism. Top-down attention has dominated many vision-and-language tasks, e.g., visual captioning (Xu et al., 2015;, visual QA (Xu and Saenko, 2016;Ye et al., 2017). Recently, a model combining both topdown and bottom-up attention reaches the winner of multiple challenges (Anderson et al., 2018). Thus, how to combine the top-down and bottomup attentions effectively is still an unexplored problem in the vision-and-language tasks. The encoder block throughout the QANet.

Approach
The NLVL task considered in this paper, is defined as follows. Given a long and untrimmed video sequence V , and a natural language query Q which describes a segment in V from start time point t s to end time point t e , NLVL needs to predict these two time points (t s , t e ) given V and Q.
In this section, we first introduce the multimodal interaction backbone of DEBUG, which is built upon the recently proposed QANet (Yu et al., 2018) for reading comprehension (Section 3.1). Then, we demonstrate the details about the proposed dense bottom-up grounding (Section 3.2). Finally, we describe the training and test stage of the whole DEBUG (Section 3.3).

Backbone: QANet
We adopt the QANet to model the interaction between two different modalities (i.e., video and language), which serves as the backbone of DEBUG. The details of the QANet are shown in Figure 3 (a), which consists of three main components: Embedding Encoder Layer. The input for this layer in two different branches are extracted video frame features F = {f i } T i=1 and query word features W = {w n } N n=1 , respectively (see details in Section 4.2). T and N are the numbers of frames and words. The embedding encoder layer is a stack of encoder blocks as shown in Figure 3 (b), which contains multiple components, including convolutional layer, layer-normalization layer, self-attention layer, and feedforward layer. The output of this layer is new frame featuresF = {f i } T i=1 or word featuresW = {w n } N n=1 , which encode the context in its respective modality. Visual-language Attention Layer. It calculates two different attention weights between the two modal features. Specifically, it first computes a similarity matrix S ∈ R T ×N , where S ij denotes the similarity between frame featuref i and word featurew j . Then the two attention weights are: whereS andS are the row-wise and column-wise normalized matrix of S, respectively. Model Encoder Layer. Given the two attention weights A and B, the model encoder layer begins to encode the interaction between the two modal features. This layer is also a stack of encoder blocks (Figure 3 (b)), and these encoder blocks share parameters. The input in i-th position is is a frame feature encoded with multimodal context. Then H is fed into the following head network (i.e., dense bottom-up grounding) for boundaries prediction. We refer readers to QANet (Yu et al., 2018) paper for more details.

Dense Bottom-Up Grounding
Since the nature of the bottom-up approach, DE-BUG regards each frame as a training sample. Different from the existing sparse bottom-up models which only use the exact start and end boundary frames as foreground, DEBUG utilizes all frames falling in the ground truth segment as positive samples. For each sample, there are three branch subnets, which aim to predict its classification score, boundary distances, and confidence score respectively. Specifically, the whole architecture of DEBUG is shown in Figure 4, and the details about the three branch subnets are: Classification Subnet. The classification subnet predicts the relatedness between each video frame and the language query, i.e., whether the frame is a foreground frame. Taking the multimodal feature H ∈ R T ×D from the backbone, this subnet applies four 1×3 conv layers, each with D filters and each followed by ReLU activations, followed by a 1×3 conv layer with 1 filter. Finally, sigmoid activations are attached to output the foreground prediction score per location. For a positive sample (i.e., foreground), its ground truth classification label is c * i = 1, otherwise c * i = 0. Boundary Regression Subnet. The boundary regression subnet predicts the unique distances from the location of each frame to the bi-directional ground truth boundaries. The design of the boundary regression subnet is identical to the classification subnet except it terminates in 2 outputs for both left and right distances. We only assign boundary regression targets for positive frames. Specifically, for a positive frame at i-th position, if the ground truth segment range is (t s , t e ) (i.e., t s ≤ i ≤ t e ), the regression target is t where l * i and r * i represents the distance from i-th frame to the left and right boundaries, respectively. Confidence Subnet. The design of the confidence subnet is identical to the classification subnet, but it predicts the confidence of the boundary regression results for each frame. The motivation of this subnet design comes from that the prediction confidences from different frames should be different, e.g., it is more difficult for a frame near the start point to detect the end point than a frame near the end point. Therefore, we set the ground truth confidence of each frame based on its "centerness" in the segment. Given the regression target l * i and r * i , the ground truth confidence score is defined as: The confidence score ranges from 1 to 0 with the frame position from segment center to boundary.

Training and Inference
Loss. Given all frames predictions {(ĉ i ,t i ,ê i )} and the corresponding ground truth {(c * i , t * i , e * i )}, the total training loss function for DEBUG is: where L cls and L conf both are binary cross entropy (BCE) loss for classification subnet and confidence subnet. L reg is the IOU loss for boundary regression subnet (i.e., − ln ). N and N p denotes the number of total samples and positive samples, respectively. α and β are loss weights to balance different losses, we set both α and β to 1 in all experiments. 1 {c * i =1} is an indicator function, being 1 if c * i = 1 and 0 otherwise. Inference. Given a video and a language query, we forward them through the network and obtain c i ,t i ,ê i for each frame from three subnets. Then, we rank all segment predictions by the scoreŝ: s i =ĉ i ×ê i . A straightforward solution is selecting the segment with the highest score as the final prediction. However, segment prediction from a single frame is usually with high variance. To relieve this situation, we propose a simple yet effective strategy: Temporal Pooling, to fuse multiple frame predictions. As shown in Figure 5, temporal pooling directly uses the leftmost and rightmost boundaries among all pooling candidates as its output. As for the pooling candidates, they need to meet with two conditions simultaneously: 1) the predicted segment is overlapped with the one with the highest score; 2) the score is large than the highest score multiple a threshold δ 2 .

Datasets and Metrics
Datasets. We evaluated the DEBUG on three challenging NLVL benchmarks: TACoS (Regneri et al., 2013). It consists of 127 videos and 17,344 text-to-clip pairs. In our experiments, we used the standard split same as (Gao et al., 2017), i.e., 50% for training, 25% for validation and 25% for test. The average length of each video is 5 minutes. Charades-STA (Gao et al., 2017). It consists of 12,408 text-to-clip pairs for training, and 3,720 pairs for test. The average length of each video is 30 seconds. ActivityNet Captions (Krishna et al., 2017). It is not only the largest NLVL dataset (19,209 videos) but also with much more diverse context than the others. We followed (Yuan et al., 2019) and utilized the public train set (i.e., 37,421 text-to-clip pairs) for training, and the validation set (i.e., 17,505 text-to-clip pairs) for test. The average length of each video is 2 minutes. Evaluation Metrics. Following the conventions in previous works, we evaluated NLVL on two prevailing evaluation metrics: 1) R@N, IoU@θ: The percentage of testing samples which have at least one of the top-N results with IoU larger than θ. Since the nature of the bottom-up framework, we only use N =1 in all our experiments; 2) mIoU: The average IoU over all testing samples.

Implementation Details
Given an untrimmed video V , we first downsampled frames and utilized the C3D (D. Tran et al., 2015) feature pretrained on Sports-1M (A. Karpathy et al., 2014) as the initial frame features. Then, we reduced the dimension of these features to 500 using PCA, which are the video frame features F (Section 3.1). For query Q, it was truncated or padded to a maximum length of 15 words. Each word was initialized with the 300d Glove vector (J. Pennington et al., 2014), and all word embeddings were fixed. Then we learned a transformation matrix to map these embeddings to 500-d, which are the sentence word features W (Section 3.1). The dimension of all intermediate layers in the backbone and three subnets was set to 128. We trained the whole network for 100 epochs from scratch, and the loss was optimized by Adam algorithm (D.P. Kingma and J.Ba, 2015). The learning rate started from 0.0001 and was divided by 10 when the loss plateaus. The batch size was set to 16, and the dropout rate was 0.5.
From Table 1, we can observe that the DEBUG achieves a new state-of-the-art performance under all evaluation metrics and benchmarks. It is worth noting that DEBUG can especially improve the performance significantly in some more strict metrics (e.g., 2.62% and 2.52% absolute improvement in mIoU on dataset TACoS and Activi-tyNet Captions, and 2.12% absolute improvement in IoU@0.7 on Charades-STA 3 ), which demonstrates the effectiveness of the DEBUG.

Ablative Studies
In this section, we did extensive ablative experiments to thoroughly investigate the DEBUG.

Sparse vs. Dense Bottom-Up
Setting. To eliminate the influence of backbones and equally investigate the performance gain between DEBUG and the existing sparse bottom-up framework, we designed a strong baseline dubbed as QANet-SE. Its backbone is identical to DE-BUG (Figure 3), but its head network follows the sparse bottom-up framework, i.e., it predicts the start and end time points directly.
Results. The results are reported in Table 2. We can observe that DEBUG surpasses QANet-SE over all metrics and benchmarks. Especially, the performance gains are much more obvious in TACoS (e.g., over 20%∼40% relative improvements in all metrics). This is because the average length of each video in TACoS is largest among all benchmarks, and the QANet-SE style (i.e., sparse bottom-up) method suffers severe positive and negative samples imbalance in long videos. Instead, DEBUG can relieve this problem by utilizing much more positive training samples.

Importance of Each Component
We ran a number of experiments to analyze the importance of each component in DEBUG. Results are shown in Table 3 and discussed in detail next. Classification vs. Confidence Subnet. From Table 3, we can observe that models with only a single classification or confidence subnet can get comparable performance. More precisely, the latter one is slightly better than the former. This is  97 33.74 20.34 10.97 13.83 48.52 34.68 16.40 32.32 70.93 54.94 38.85 38.48 35.22 22.07 11.44 14.56 52.16 35.89 17.92 34.04 73.34 55.82 39.20 39.01 37.59 22.76 11.40 15.23 51.72 34.60 16.94 34.15 73.56 55.43 39.31 38.74 40.14 22.27 11.58 15.43 51.67 35.38 15.51 33.86 73.62 55.52 39.00 39.33 41.15 23.45 11.72 16.03 54.95 37.39 17.69 36.34 74.26 55.91 39.72 39.51    because the confidence subnet considers the importance of each frame, instead, the classification subnet regards all foreground equally. Meanwhile, the performance of both models can be further boosted by utilizing two subnets simultaneously, which demonstrates that this multi-task design helps each subnet to focus on their own goal and both subnets benefit from sharing features (i.e., one for foreground prediction, and another for "centerness" prediction).
With vs. Without Temporal Pooling. From Table 3, we can observe that the temporal pooling trick improves the performance in most of the situations, and the performance gains in TACoS are largest over all benchmarks. The main reason is that the visual appearance of each frame in TACoS is quite similar, i.e., the performance based on single frame prediction is very unstable since multiple frames have similar predictions. Instead, the model with temporal pooling helps to avoid this by fusing multiple frame predictions.

Efficiency Analysis.
We evaluated the efficiency of DEBUG, by comparing the average run time to localize one sentence in the video. As shown in  RNN, VSA-STV, CTRL, ACRN), and the gap is much wider in long video datasets (e.g., TACoS). This meets with the notorious drawback of the topdown framework, i.e., it is computation-intensive for dense sliding windows or anchors. Meanwhile, DEBUG is even slightly faster than the sparse bottom-up model (ABLR), this is because the QANet backbone only uses conv and self-attention layer instead of the time-consuming RNN in ABLR backbone. All experiments are conducted on the same hardware (an NVIDIA GTX 1080Ti).

Error Analysis.
To analyze the bottleneck of DEBUG and the existing sparse bottom-up framework, and help pave Language Query: A woman is seen speaking to the camera and then pours ice into a glass. Language Query: A boy is balanced on a board, then shows several moves and stunts as he moves around a skate park.

DEBUG:
49.5s 152.9s Language Query: Two men are seen standing behind drums and playing with one another. GT:

38.4s
Language Query: The stop playing in the end to speak to one another and walk off stage.

DEBUG:
88.5s 104.5s Segment Length. We compared the length error of the segments predicted by DEBUG and QANet-SE, the results are illustrated in Figure 6. We observe that QANet-SE is prone to predicting overlong segment range (e.g., the samples with length error larger than 300 frames in Chardes-STA or 5,000 frames in TACoS take a large proportion.) Keypoint Accuracy. We compared the accuracy of three keypoints of the ground truth segment: the left, right, and middle point. We regard a keypoint prediction as right if the absolute frame difference between the prediction and ground truth is smaller than a threshold. We used three thresholds (100,200,300) and the results are reported in Table 5. We have two observations: 1) For middle point, DEBUG always gets higher accuracy. 2) For boundary points, DEBUG only drops behind QANet-SE in TACoS at threshold 100.
Analysis. The sparse bottom-up approach (e.g., QANet-SE) can get good boundary predictions even for long video datasets (e.g., TACoS), which meets with the design of training a boundary classifier. But the bottleneck of this approach is that the predictions of start and end point are independent, which is prone to result in an overlong segment prediction. However, DEBUG predicts boundary from the same frame feature, which can avoid predicting overlong segments. Meanwhile, DEBUG has a confidence subnet to predict "centerness" of each frame, which helps to predict the middle point. But the bottleneck of DEBUG is the accuracy of boundary point in long video dataset.

Qualitative Results.
The qualitative results of DEBUG on ActivityNet Captions is illustrated in Figure 7. We can observe that DEBUG is sensitive to language query, i.e., the predicted scoresŝ are totally different when given different language queries even for the same video. Meanwhile, the scoreŝ is always a unimodal curve with its peak near the midpoint of ground truth segment, which meets with the design of DEBUG which uses "centerness" as the confidence target.

Conclusion
We proposed a novel dense bottom-up framework DEBUG for NLVL. It is the first bottom-up model, which surpasses all top-down models with the highest speed. Compared to the existing bottomup models, DEBUG improve performance significantly by: 1) making full use of positive samples to alleviate the severe imbalance problem; 2) disentangling boundary detection from queryrelated prediction to relieve the burden of a single network; 3) predicting boundaries from same frame to avoid local optimum caused by independent predictions. Moving forward, we are going to narrow the gap between top-down and bottom-up models, and design a hybrid framework exploring both choices.