Span-based Localizing Network for Natural Language Video Localization

Given an untrimmed video and a text query, natural language video localization (NLVL) is to locate a matching span from the video that semantically corresponds to the query. Existing solutions formulate NLVL either as a ranking task and apply multimodal matching architecture, or as a regression task to directly regress the target video span. In this work, we address NLVL task with a span-based QA approach by treating the input video as text passage. We propose a video span localizing network (VSLNet), on top of the standard span-based QA framework, to address NLVL. The proposed VSLNet tackles the differences between NLVL and span-based QA through a simple and yet effective query-guided highlighting (QGH) strategy. The QGH guides VSLNet to search for matching video span within a highlighted region. Through extensive experiments on three benchmark datasets, we show that the proposed VSLNet outperforms the state-of-the-art methods; and adopting span-based QA framework is a promising direction to solve NLVL.


Introduction
Given an untrimmed video, natural language video localization (NLVL) is to retrieve or localize a temporal moment that semantically corresponds to a given language query. An example is shown in Figure 1. As an important vision-language understanding task, NLVL involves both computer vision and natural language processing techniques (Krishna et al., 2017;Hendricks et al., 2017;Gao et al., 2018;. Clearly, cross-modal reasoning is essential for NLVL to correctly locate the target moment from a video. Prior works primarily treat NLVL as a ranking task, which is solved by applying multimodal matching architecture to find the best matching video segment for a given language query (Gao et al., 2017;Hendricks et al., 2018;Liu et al., 2018a;Ge et al., 2019;Chen and Jiang, 2019;Zhang et al., 2019). Recently, some works explore to model cross-interactions between video and query, and to regress the temporal locations of target moment directly (Yuan et al., 2019b;Lu et al., 2019a). There are also studies to formulate NLVL as a sequence decision making problem and to solve it by reinforcement learning . We address the NLVL task from a different perspective. The essence of NLVL is to search for a video moment as the answer to a given language query from an untrimmed video. By treating the video as a text passage, and the target moment as the answer span, NLVL shares significant similarities with span-based question answering (QA) task. The span-based QA framework (Seo et al., 2017;Huang et al., 2018) can be adopted for NLVL. Hence, we attempt to solve this task with a multimodal span-based QA approach.
There are two main differences between traditional text span-based QA and NLVL tasks. First, video is continuous and causal relations between video events are usually adjacent. Natural language, on the other hand, is inconsecutive and words in a sentence demonstrate syntactic structure. For instance, changes between adjacent video frames are usually very small, while adjacent word to-kens may carry distinctive meanings. As the result, many events in a video are directly correlated and can even cause one another (Krishna et al., 2017). Causalities between word spans or sentences are usually indirect and can be far apart. Second, compared to word spans in text, human is insensitive to small shifting between video frames. In other words, small offsets between video frames do not affect the understanding of video content, but the differences of a few words or even one word could change the meaning of a sentence.
As a baseline, we first solve the NLVL task with a standard span-based QA framework named VSLBase. Specifically, visual features are analogous to that of text passage; the target moment is regarded as the answer span. VSLBase is trained to predict the start and end boundaries of the answer span. Note that VSLBase does not address the two aforementioned major differences between video and natural language. To this end, we propose an improved version named VSLNet (Video Span Localizing Network). VSLNet introduces a Query-Guided Highlighting (QGH) strategy in addition to VSLBase. Here, we regard the target moment and its adjacent contexts as foreground, while the rest as background, i.e., foreground covers a slightly longer span than the answer span. With QGH, VSLNet is guided to search for the target moment within a highlighted region. Through region highlighting, VSLNet well addresses the two differences. First, the longer region provides additional contexts for locating answer span due to the continuous nature of video content. Second, the highlighted region helps the network to focus on subtle differences between video frames, because the search space is reduced compared to the full video.
Experimental results on three benchmark datasets show that adopting span-based QA framework is suitable for NLVL. With a simple network architecture, VSLBase delivers comparable performance to strong baselines. In addition, VSLNet further boosts the performance and achieves the best among all evaluated methods.

Related Work
Natural Language Video Localization. The task of retrieving video segments using language queries was introduced in (Hendricks et al., 2017;Gao et al., 2017). Solutions to NLVL need to model the cross-interactions between natural language and video. The early works treat NLVL as a ranking task, and rely on multimodal matching architecture to find the best matching video moment for a language query (Gao et al., 2017;Hendricks et al., 2017Hendricks et al., , 2018Wu and Han, 2018;Liu et al., 2018a,b;Zhang et al., 2019). Although intuitive, these models are sensitive to negative samples. Specifically, they need to dense sample candidate moments to achieve good performance, which leads to low efficiency and lack of flexibility.
Various approaches have been proposed to overcome those drawbacks. Yuan et al. (2019b) builds a proposal-free method using BiLSTM and directly regresses temporal locations of target moment. Lu et al. (2019a) proposes a dense bottom-up framework, which regresses the distances to start and end boundaries for each frame in target moment, and select the ones with highest confidence as final result. Yuan et al. (2019a) proposes a semantic conditioned dynamic modulation for better correlating sentence related video contents over time, and establishing a precise matching relationship between sentence and video. There are also works  that formulate NLVL as a sequence decision making problem, and adopt reinforcement learning based approaches, to progressively observe candidate moments conditioned on language query.
Most similar to our work are  and (Ghosh et al., 2019), as both studies are considered using the concept of question answering to address NLVL. However, both studies do not explain the similarity and differences between NLVL and traditional span-based QA, and they do not adopt the standard span-based QA framework. In our study, VSLBase adopts standard span-based QA framework; and VSLNet explicitly addresses the differences between NLVL and traditional spanbased QA tasks.
Span-based Question Answering. Span-based QA has been widely studied in past years. Wang and Jiang (2017) combines match-LSTM (Wang and Jiang, 2016) and Pointer-Net (Vinyals et al., 2015) to estimate boundaries of the answer span. BiDAF (Seo et al., 2017) introduces bi-directional attention to obtain query-aware context representation. Xiong et al. (2017) proposes a coattention network to capture the interactions between context and query. R-Net  integrates mutual and self attentions into RNN encoder for feature refinement. QANet (Yu et al., 2018)   ages a similar attention mechanism in a stacked convolutional encoder to improve performance. Fu-sionNet (Huang et al., 2018) presents a full-aware multi-level attention to capture complete query information. By treating input video as text passage, the above frameworks are all applicable to NLVL in principle. However, these frameworks are not designed to consider the differences between video and text passage. Their modeling complexity arises from the interactions between query and text passage, both are text. In our solution, VSLBase adopts a simple and standard span-based QA framework, making it easier to model the differences between video and text through adding additional modules. Our VSLNet addresses the differences by introducing the QGH module. Very recently, pre-trained transformer based language models (Devlin et al., 2019; have elevated the performance of span-based QA tasks by a large margin. Meanwhile, similar pre-trained models (Sun et al., 2019a,b;Rahman et al., 2019;Nguyen and Okatani, 2019;Lu et al., 2019b;Tan and Bansal, 2019) are being proposed to learn joint distributions over multimodality sequence of visual and linguistic inputs. Exploring the pre-trained models for NLVL is part of our future work and is out of the scope of this study.

Methodology
We now describe how to address NLVL task by adopting a span-based QA framework. We then present VSLBase (Sections 3.2 to 3.4) and VSLNet in detail. Their architectures are shown in Figure 2.

Span-based QA for NLVL
We denote the untrimmed video as V = {f t } T t=1 and the language query as Q = {q j } m j=1 , where T and m are the number of frames and words, re-spectively. τ s and τ e represent the start and end time of the temporal moment i.e., answer span. To address NLVL with span-based QA framework, its data is transformed into a set of SQuAD style triples (Context, Question, Answer) (Rajpurkar et al., 2016). For each video V , we extract its visual features V = {v i } n i=1 by a pre-trained 3D ConvNet (Carreira and Zisserman, 2017), where n is the number of extracted features. Here, V can be regarded as the sequence of word embeddings for a text passage with n tokens. Similar to word embeddings, each feature v i here is a video feature vector.
Since span-based QA aims to predict start and end boundaries of an answer span, the start/end time of a video sequence needs to be mapped to the corresponding boundaries in the visual feature sequence V. Suppose the video duration is T , the start (end) span index is calculated by a s(e) = τ s(e) /T ×n , where · denotes the rounding operator. During the inference, the predicted span boundary can be easily converted to the corresponding time via τ s(e) = a s(e) /n × T .
After transforming moment annotations in NLVL dataset, we obtain a set of (V, Q, A) triples. Visual features V = [v 1 , v 2 , . . . , v n ] act as the passage with n tokens; Q = [q 1 , q 2 , . . . , q m ] is the query with m tokens, and the answer A = [v a s , v a s +1 , . . . , v a e ] corresponds to a piece in the passage. Then, the NLVL task becomes to find the correct start and end boundaries of the answer span, a s and a e .

Feature Encoder
We already have visual features V = {v i } n i=1 ∈ R n×dv . Word embeddings of a text query Q, Q = {q j } m j=1 ∈ R m×dq , are easily obtainable e.g., GloVe. We project them into the same dimension d, V ∈ R n×d and Q ∈ R m×d , by two linear layers (see Figure 2(a)). Then we build the feature encoder with a simplified version of the embedding encoder layer in QANet (Yu et al., 2018).
Instead of applying a stack of multiple encoder blocks, we use only one encoder block. This encoder block consists of four convolution layers, followed by a multi-head attention layer (Vaswani et al., 2017). A feed-forward layer is used to produce the output. Layer normalization (Ba et al., 2016) and residual connection (He et al., 2016) are applied to each layer. The encoded visual features and word embeddings are as follows: The parameters of feature encoder are shared by visual features and word embeddings.

Context-Query Attention
After feature encoding, we use context-query attention (CQA) (Seo et al., 2017;Xiong et al., 2017;Yu et al., 2018) to capture the cross-modal interactions between visual and textural features. CQA first calculates the similarity scores, S ∈ R n×m , between each visual feature and query feature. Then context-to-query (A) and query-to-context (B) attention weights are computed as: where S r and S c are the row-and column-wise normalization of S by SoftMax, respectively. Finally, the output of context-query attention is written as: where V q ∈ R n×d ; FFN is a single feed-forward layer; denotes element-wise multiplication.

Conditioned Span Predictor
We construct a conditioned span predictor by using two unidirectional LSTMs and two feed-forward layers, inspired by Ghosh et al. (2019). The main difference between ours and Ghosh et al. (2019) is that we use unidirectional LSTM instead of bidirectional LSTM. We observe that unidirectional LSTM shows similar performance with fewer parameters and higher efficiency. The two LSTMs are stacked so that the LSTM of end boundary can be conditioned on that of start boundary. Then the hidden states of the two LSTMs are fed into the Query: He uses the tool to take off all of the nuts one by one. corresponding feed-forward layers to compute the start and end scores:

Foreground
where f CE represents cross-entropy loss function; Y s and Y e are the labels for the start (a s ) and end (a e ) boundaries, respectively. During inference, the predicted answer span (â s ,â e ) of a query is generated by maximizing the joint probability of start and end boundaries by: span(â s ,â e ) = arg max a s ,â e P s (â s )P e (â e ) s.t. 0 ≤â s ≤â e ≤ n We have completed the VSLBase architecture (see Figure 2(a)). VSLNet is built on top of VSLBase with QGH, to be detailed next.

Query-Guided Highlighting
A Query-Guided Highlighting (QGH) strategy is introduced in VSLNet, to address the major differences between text span-based QA and NLVL tasks, as shown in Figure 2(b). With QGH strategy, we consider the target moment as the foreground, and the rest as background, illustrated in Figure 3. The target moment, which is aligned with the language query, starts from a s and ends at a e with length L = a e − a s . QGH extends the boundaries of the foreground to cover its antecedent and consequent Figure 4: The structure of Query-Guided Highlighting.
video contents, where the extension ratio is controlled by a hyperparameter α. As aforementioned in Introduction, the extended boundary could potentially cover additional contexts and also help the network to focus on subtle differences between video frames.
By assigning 1 to foreground and 0 to background, we obtain a sequence of 0-1, denoted by Y h . QGH is a binary classification module to predict the confidence a visual feature belongs to foreground or background. The structure of QGH is shown in Figure 4. We first encode word features Q into sentence representation (denoted by h Q ), with self-attention mechanism (Bahdanau et al., 2015).
The highlighting score is computed as: where σ denotes Sigmoid activation; S h ∈ R n . The highlighted features are calculated by: Accordingly, feature V q in Equation 3 is replaced by V q in VSLNet to compute L span . The loss function of query-guided highlighting is formulated as: VSLNet is trained in an end-to-end manner by minimizing the following loss: 4 Experiments

Experimental Settings
Metrics. We adopt "R@n, IoU = µ" and "mIoU" as the evaluation metrics, following (Gao et al., 2017;Liu et al., 2018a;Yuan et al., 2019b). The "R@n, IoU = µ" denotes the percentage of language queries having at least one result whose Intersection over Union (IoU) with ground truth is larger than µ in top-n retrieved moments. "mIoU" is the average IoU over all testing samples. In our experiments, we use n = 1 and µ ∈ {0.3, 0.5, 0.7}.
Implementation. For language query Q, we use 300d GloVe (Pennington et al., 2014) vectors to initialize each lowercase word; the word embeddings are fixed during training. For untrimmed video V , we downsample frames and extract RGB visual features using the 3D ConvNet which was pre-trained on Kinetics dataset (Carreira and Zisserman, 2017). We set the dimension of all the hidden layers in the model as 128; the kernel size of convolution layer is 7; the head size of multi-head attention is 8. For all datasets, the model is trained for 100 epochs with batch size of 16 and early stopping strategy. Parameter optimization is performed by Adam (Kingma and Ba, 2015) with learning rate of 0.0001, linear decay of learning rate and gradient clipping of 1.0. Dropout (Srivastava et al., 2014) of 0.2 is applied to prevent overfitting.

Comparison with State-of-the-Arts
We compare VSLBase and VSLNet with the following state-of-the-arts: CTRL (Gao et al., Table 3: Results (%) of "R@n, IoU = µ" and "mIoU" compared with the state-of-the-art on ActivityNet Caption. 2019b) and DEBUG (Lu et al., 2019a). In all result tables, the scores of compared methods are reported in the corresponding works. Best results are in bold and second best underlined. The results on Charades-STA are summarized in Table 2   0.78% in IoU = 0.5, compared to MAN. Without query-guided highlighting, VSLBase outperforms all compared baselines over IoU = 0.7, which shows adopting span-based QA framework is promising for NLVL. Moreover, VSLNet benefits from visual feature fine-tuning, and achieves state-of-the-art results on this dataset.   formance with baselines.

Ablation Studies
We conduct ablative experiments to analyze the importance of feature encoder and context-query attention in our approach. We also investigate the impact of extension ratio α (see Figure 3) in query-guided highlighting (QGH). Finally we visually show the effectiveness of QGH in VSLNet, and also discuss the weaknesses of VSLBase and VSLNet.

Module Analysis
We study the effectiveness of our feature encoder and context-query attention (CQA) by replacing them with other modules. Specifically, we use bidirectional LSTM (BiLSTM) as an alternative feature encoder. For context-query attention, we replace it by a simple method (named CAT) which concatenates each visual feature with max-pooled query feature. Recall that our feature encoder consists of Convolution + Multi-head attention + Feed-forward layers (see Section 3.2), we name it CMF. With the alternatives, we now have 4 combinations, listed in Table 5. Observe from the results, CMF shows stable superiority over CAT on all metrics regardless of other modules; CQA surpasses CAT whichever feature encoder is used. This study indicates that CMF and CQA are more effective. Table 6 reports performance gains of different modules over "R@1, IoU = 0.7" metric. The results shows that replacing CAT with CQA leads to larger improvements, compared to replacing BiL-STM by CMF. This observation suggests CQA plays a more important role in our model. Specifically, keeping CQA, the absolute gain is 1.61% by replacing encoder module. Keeping CMF, the gain of replacing attention module is 3.09%. Figure 5 visualizes the matrix of similarity score between visual and language features in the contextquery attention (CQA) module (S ∈ R n×m in Section 3.3). This figure shows visual features are more relevant to the verbs and their objects in the query sentence. For example, the similarity scores between visual features and "eating" (or "sandwich") are higher than that of other words. We believe that verbs and their objects are more likely to be used to describe video activities. Our observation is consistent with Ge et al. (2019), where verb-object pairs are extracted as semantic activity concepts. In contrast, these concepts are automatically captured by the CQA module in our method.

The Impact of Extension Ratio in QGH
We now study the impact of extension ratio α in query-guided highlighting module on Charades-STA dataset. We evaluated 12 different values of α from 0.0 to ∞ in experiments. 0.0 represents no answer span extension, and ∞ means that the entire video is regarded as foreground.
The results for various α's are plotted in Figure 6. It shows that query-guided highlighting consistently contributes to performance improvements, regardless of α values, i.e., from 0 to ∞.
Along with α raises, the performance of VSLNet first increases and then gradually decreases. The optimal performance appears between α = 0.05 and 0.2 over all metrics.
Note that, when α = ∞, which is equivalent to no region is highlighted as a coarse region to locate target moment, VSLNet remains better than VSLBase. Shown in Figure 4, when α = ∞, QGH effectively becomes a straightforward concatenation of sentence representation with each of visual features. The resultant feature remains helpful for capturing semantic correlations between vision and language. In this sense, this function can be regarded as an approximation or simulation of the traditional multimodal matching strategy (Hendricks et al., 2017;Gao et al., 2017;Liu et al., 2018a). Figure 7 shows the histograms of predicted results on test sets of Charades-STA and ActivityNet Caption datasets. Results show that VSLNet beats VSLBase by having more samples in the high IoU ranges, e.g., IoU ≥ 0.7 on Charades-STA dataset. More predicted results of VSLNet are distributed in the high IoU ranges for ActivityNet Caption dataset. This result demonstrates the effectiveness   of the query-guided highlighting (QGH) strategy.

Qualitative Analysis
We show two examples in Figures 8(a) and 8(b) from Charades-STA and ActivityNet Caption datasets, respectively. From the two figures, the localized moments by VSLNet are closer to ground truth than that by VSLBase. Meanwhile, the start and end boundaries predicted by VSLNet are roughly constrained in the highlighted regions S h , computed by QGH.
We further study the error patterns of predicted moment lengths, as shown in Figure 9. The differences between moment lengths of ground truths and predicted results are measured. A positive length difference means the predicted moment is longer than the corresponding ground truth, while a negative means shorter. Figure 9 shows that VSLBase tends to predict longer moments, e.g., more samples with length error larger than 4 seconds in Charades-STA or 30 seconds in Activ-ityNet. On the contrary, constrained by QGH, VSLNet tends to predict shorter moments, e.g., more samples with length error smaller that −4 seconds in Charades-STA or −20 seconds in Ac-tivityNet Caption. This observation is helpful for future research on adopting span-based QA framework for NLVL.
In addition, we also exam failure cases (with IoU predicted by VSLNet lower than 0.2) shown in Figure 10. In the first case, as illustrated by Figure 10(a), we observe an action that a person turns towards to the lamp and places an item there. The QGH falsely predicts the action as the beginning of the moment "turns off the light". The second failure case involves multiple actions in a query, as shown in Figure 10(b). QGH successfully highlights the correct region by capturing the temporal information of two different action descriptions in the given query. However, it assigns "pushes" with higher confidence score than "grabs". Thus, VSLNet only captures the region corresponding to the "pushes" action, due to its confidence score.

Conclusion
By considering a video as a text passage, we solve the NLVL task with a multimodal span-based QA framework. Through experiments, we show that adopting a standard span-based QA framework, VSLBase, effectively addresses NLVL problem. However, there are two major differences between video and text. We further propose VSLNet, which introduces a simple and effective strategy named query-guided highlighting, on top of VSLBase. With QGH, VSLNet is guided to search for answers within a predicted coarse region. The effectiveness of VSLNet (and even VSLBase) suggest that it is promising to explore span-based QA framework to address NLVL problems.