WSLLN: Weakly Supervised Natural Language Localization Networks

We propose weakly supervised language localization networks (WSLLN) to detect events in long, untrimmed videos given language queries. To learn the correspondence between visual segments and texts, most previous methods require temporal coordinates (start and end times) of events for training, which leads to high costs of annotation. WSLLN relieves the annotation burden by training with only video-sentence pairs without accessing to temporal locations of events. With a simple end-to-end structure, WSLLN measures segment-text consistency and conducts segment selection (conditioned on the text) simultaneously. Results from both are merged and optimized as a video-sentence matching problem. Experiments on ActivityNet Captions and DiDeMo demonstrate that WSLLN achieves state-of-the-art performance.


Introduction
Extensive work has been done on temporal action/activity localization (Shou et al., 2016;Zhao et al., 2017;Dai et al., 2017;Buch et al., 2017;Gao et al., 2017c;Chao et al., 2018), where an action of interest is segmented from long, untrimmed videos.These methods only identify actions from a pre-defined set of categories, which limits their application to situations where only unconstrained language descriptions are available.This more general problem is referred to as natural language localization (NLL) (Hendricks et al., 2017;Gao et al., 2017a).The goal is to retrieve a temporal segment from an untrimmed video based on an arbitrary text query.Recent work focuses on learning the mapping from visual segments to the input text (Hendricks et al., 2017;Gao et al., 2017a;Liu et al., 2018;Hendricks et al., 2018;Zhang et al., 2018) and retrieving segments based on the alignment scores.However, in order to successfully train a NLL model, a large number of diverse language descriptions are needed to describe different temporal segments of videos which incurs high human labeling cost.
We propose Weakly Supervised Language Localization Networks (WSLLN) which requires only video-sentence pairs during training with no information of where the activities temporally occur.Intuitively, it is much easier to annotate videolevel descriptions than segment-level descriptions.Moreover, when combined with text-based video retrieval techniques, video-sentence pairs may be obtained with minimum human intervention.The proposed model is simple and clean, and can be trained end-to-end in a single stage.We validate our model on ActivityNet Captions and DiDeMo.The results show that our model achieves the stateof-the-art of the weakly supervised approach and has comparable performance as some supervised approaches.

Related Work
Temporal Action Localization in long videos is widely studied in both offline and online scenarios.In the offline setting, temporal action detectors (Shou et al., 2016;Buch et al., 2017;Gao et al., 2017c;Chao et al., 2018) predict the start and end times of actions after observing the whole video, while online approaches (De Geest et al., 2016;Gao et al., 2017b;Shou et al., 2018b;Xu et al., 2018;Gao et al., 2019) label action class in a per-frame manner without accessing future information.The goal of temporal action detectors is to localize actions in pre-defined categories.However, activities in the wild is very complicated and it is challenging to cover all the activities of interest by using a finite set of categories.
Natural Language Localization in untrimmed videos was first introduced in (Gao et al., 2017a;Hendricks et al., 2017), where given an arbitrary text query, the methods attempt to localize the text (predict its start and end times) in a video.Hendricks et al. proposed MCN (Hendricks et al., 2017) which embeds the features of visual proposals and sentence representations in the same space and ranks proposals according their similarity with the sentence.Gao et al. proposed CTRL (Gao et al., 2017a), where alignment and regression are conducted for clip candidates.Liu et al. introduced TMN (Liu et al., 2018) (Duan et al., 2018) by alternating between language localization and caption generation iteratively.WSDEC generates language localization as intermediate results and can be trained using video-level labels.Thus, we set it as a baseline, although it is not designed for NLL.
Weakly Supervised Localization has been studied extensively to use weak supervisions for object detection on images and action localization in videos (Oquab et al., 2015;Bilen and Vedaldi, 2016;Tang et al., 2017;Gao et al., 2018;Kantorov et al., 2016;Li et al., 2016;Jie et al., 2017;Diba et al., 2017;Papadopoulos et al., 2017;Duchenne et al., 2009;Laptev et al., 2008;Bojanowski et al., 2014;Huang et al., 2016;Wang et al., 2017;Shou et al., 2018a).Some methods use class labels to train object detectors.Oquab et al. discussed that object locations may be freely obtained when training classification models (Oquab et al., 2015).Bilen et al. proposed WSDDN (Bilen and Vedaldi, 2016), which focuses on both object recognition and localization.Their proposed two-stream architecture inspired several weakly supervised approaches (Tang et al., 2017;Gao et al., 2018;Wang et al., 2017) including our method.Li et al. presented an adaptation strategy in (Li et al., 2016) which uses the output of a weak detector as pseudo groundtruth to train a detector in a fully supervised way.OICR (Tang et al., 2017) integrates multiple instance learning and iterative classifer refinement in a single network.Some works use other types of weak supervisions to optimize detectors.In (Papadopoulos et al., 2017), Papadopoulos et al. used clicks to train detectors.Gao et al. utilized object counts for weakly supervised object detection (Gao et al., 2018).Instead of using temporally labeled segments, weakly supervised action detectors use weaker annotations, e.g., movie script (Duchenne et al., 2009;Laptev et al., 2008), the order of the occurring action classes in videos (Bojanowski et al., 2014;Huang et al., 2016) and video-level class labels (Wang et al., 2017;Shou et al., 2018a).

Problem Statement
Following the setting of its strongly supervised counterpart (Gao et al., 2017a;Hendricks et al., 2017), the goal of a weakly supervised language localization (WSLL) method is to localize the event that is described by a sentence query in a long, untrimmed video.Formally, given a video consisting of a sequence of image frames, and a text query Q i , the model aims to localize a temporal segment, [I st i , ..., I ed i ], which semantically aligns best with the query.st and ed indicate the start and end times, respectively.The difference is that WSLL methods only utilize video-sentence pairs, {V i , Q i } N i=1 , for training, while supervised approaches have access to the start and end times of the queries.

The Proposed Approach
Taking frame sequences, [I 1 i , I 2 i , ..., I T i ], as inputs, the model first generates a set of temporal proposals, {p 1 i , p 2 i , ..., p n i }, where p j i consists of temporally-continuous image frames.Then, the method aligns the proposals with the input query and outputs scores for proposals, {s 1 i , s 2 i , ..., s n i }, indicating their likelihood of containing the event.Feature Description.Given a sentence query Q i of arbitrary length, sentence encoders can be used to extract text feature, f q i , from the query.For a video, are extracted from each frame.Following (Hendricks et al., 2017), the visual fea- ture, f p j i , of a proposal p j i is obtained using Eq. 1, where pool(x, t 1 , t 2 ) means average pooling features x from time t 1 to t 2 , || indicates concatenation, j st /j ed indicates start/end times of the proposal and j means time is normalized to [0, 1].
We see that the feature of each proposal contains the information of its visual pattern, the overall context and its relative position in the video.
Following (Gao et al., 2017a), features of the sentence and a visual proposal are combined as in Eq. 2. The feature, f m, will be used to measure the matching between a candidate proposal and the input query.
The workflow of WSLLN is illustrated in Fig. 1.Inspired by the success of the two-stream structure in the weakly supervised object and action detection tasks (Bilen and Vedaldi, 2016;Wang et al., 2017), WSLLN consists of two branches, i.e., alignment branch and selection branch.The semantic consistency between the input text and each visual proposal is measured in the alignment branch.The proposals are compared and selected in the detection branch.Scores from both branches are merged to produce the final results.Alignment Branch produces the consistency scores, sa i ∈ R n×2 = [sa 1 i , sa 2 i , ..., sa n i ], for proposals of the video-sentence pair.sa i in Eq. 3, measures how well each proposal matches the text.
Different proposal scores are calculated independently where sof tmax a indicates applying the softmax function over the last dimension.
Detection Branch performs proposal selection.The selection score, sd i ∈ R n×2 = [sd 1 i , sd 2 i , ..., sd n i ] in Eq. 4, is obtained by applying softmax function over proposals.Through softmax, the score of a proposal will be affected by those of other proposals, so this operation encourages competition among segments.
Score Merging is applied to both parts to obtain the results by dot production, i.e., s i = sa i •sd i , for proposals.s i is used as the final segment-sentence matching scores during inference.
Training Phase.To utilize video-sentence pairs as supervision, our model is optimized as a videosentence matching classifier.We compute the matching score of a given video-sentence pair by summing s j i over proposals, vq i = n j=1 s j i .Then, L v is obtained in Eq. 5 by measuring the score with the video-sentence match label l i ∈ {0, 1}.Positive video-sentence pairs can be obtained directly.We generate negative ones by pairing each video with a randomly selected sentence in the training set.We ensure that the positive pairs are not included in the negative set.
Results can be further refined by adding an auxiliary task L r in Eq. 6 where ŷi = {0, 1, ..., n − 1} indicates the index of the segment that best matches the sentence during training.
The real segment-level labels are not available, thus we generate pseudo labels by setting ŷi = argmax j s j i [:, 1].This loss further encourages competition among proposals.
The overall objective is minimizing L in Eq. 7, where λ is a balancing scalar.loss is crossentropy loss.

Experimental Settings
Implementation Details.BERT (Devlin et al., 2018) is used as the sentence encoder, where the feature of '[CLS]' at the last layer is extracted as the sentence representation.Visual and sentence features are linearly transformed to have the same dimension, d = 1000.The hidden layers for both branches have 256 units.For ActivityNet Captions, we take the n = 15 proposals over multiple scales of each video provided by (Duan et al., 2018) and use the C3D (Tran et al., 2015) features provided by (Krishna et al., 2017).For DiDeMo, we use the n = 21 proposals and VGG (Simonyan and Zisserman, 2014) features (RGB and Flow) provided in (Hendricks et al., 2017).Evaluation Metrics.Following (Gao et al., 2017a;Hendricks et al., 2017), R@k,IoU=th and mIoU are used for evaluation.Proposals are ranked according to their matching scores with the input sentence.If the temporal IoU between at least one of the top-k proposals and the groundtruth is bigger or equal to th, the sentence is counted as matched.R@k,IoU=th means the percentage of matched sentences over the total sentences given k and th.mIoU is the mean IoU between the top-1 proposal and the groundtruth.Although the dataset provides segment-level annotation, we only use video-sentence pairs during training.

Experiments on
Baselines.We compare with strongly supervised approaches, i.e., CTRL (Gao et al., 2017a), ABLR (Yuan et al., 2018) and WSDEC-S (Duan et al., 2018) to see how much accuracy it sacrifices when using only weak labels.Originally proposed for dense-captioning, WSDEC-W (Duan et al., 2018) achieves state-of-the-art performance for weakly supervised language localization.Although showing good performance, WSDEC-W involves complicated training stages, and alternates between sentence localization and caption generation for iterations.

Comparison Results
Comparison results are displayed in Tab. 1.It shows that WSLLN largely outperforms WSDEC-W by ∼4% mIoU .When comparing with strongly supervised methods, WSLLN outperforms CTRL by over 11% mIoU .Using the R@1, IoU = 0.1 metric, our model largely outperforms all the baselines including strongly and weakly supervised methods which means that when a scenario is flexible with the IoU coverage, our method has great advantage over others.When th =0.3/0.5, our model has comparable results as WSDEC-W and largely outperforms CTRL.The overall results demonstrate good performance of WSLLN, even though there is still a big gap between weakly supervised methods and some supervised ones, i.e., ABLR and WSDEC-S.mIoU (mean±std) of WSLLN across 3 runs is 32.2 ± 0.05 which demonstrates the robustness of our method.

Ablation Study
Effect of λ.We evaluate the effect of λ (see Eq. 7) in Tab. 2. As it shows, our model performs stable when λ is set from 0.1 to 0.4.When λ = 0, the refining module is disabled and the performance drops.When λ is set to a big number, e.g., 0.5, the contribution of L v is reduced and the model performance also drops.
Effect of Sentence Encoder.WSDEC-W uses GRU (Cho et al., 2014) as its sentence encoder, while our method uses BERT.It seems an unfair comparison, since BERT is powerful than GRU in general.However, we uses pretrained BERT model without fine tuning on our dataset, while WSDEC-W uses GRU but performed an end-toend training.So, it is unclear which setting is better.To resolve this concern, we replace our BERT with GRU following WSDEC-W.The R@1 results when IoU is set to be 0.1, 0.3 and 0.5 are 74.0,42.3 and 22.5, respectively.The mIoU is 31.8.It shows that our model with GRU has comparable results as that with BERT.Effect of Two-branch Design.We create two baselines, ie, Align-only and Detect-only, to demonstrate the effectiveness of our design.To perform fair comparison, both of them are trained using only video-sentence pairs.Align-only contains only the alignment branch.For positive video sentence pair, we give positive labels to all proposals.Negative pairs have negative labels for all the proposals.Loss is calculated between proposal scores and the generated segment-level labels.
Detect-only contains only the detection branch.Loss is calculated using the highest detection score over proposals and the video-level label at each training iteration.
Comparison results are displayed in Tab. 3. It shows that the two baselines underperform WSLLN by a large margin, which demonstrates the effectiveness of our design.

Experiments on DiDeMo
Dataset Description.DiDeMo was proposed in (Hendricks et al., 2017) for the language localization task.It contains 10k, 30-second videos including 40k annotated segment-sentence pairs.

Model
IoU=0.1 IoU=0.3IoU=0.5 mIoU Align-only 40.0 18.9 7.5 13.4 Detect-only 33.7 18.3 10.4 13.6 Our models are trained using video-sentence pairs in the train set and tested on the test set.Baselines.To the best of our knowledge, no weakly supervised method has been evaluated on DiDeMo.So, we compare with some supervised methods, i.e., MCN (Hendricks et al., 2017) and LOR (Hu et al., 2016).MCN is a supervised NLL model.LOR is a supervised language-object retrieval model.It utilizes much more expensive (object-level) annotations for training.We follow the same setup of LOR as in (Hendricks et al., 2017) to evaluate LOR for our task.
Comparison Results are shown in Tab. 4. WSLLN performs better than LOR in terms of R@1/5.We also observe that the gap between our method and the supervised NLL model is much larger on DiDeMo than on ActivityNet Captions.This may be due to the fact that DiDeMo is a much smaller dataset which is a disadvantage for weakly supervised learning.

Conclusion
We propose WSLLN-a simple language localization network.Unlike most existing methods which require segment-level supervision, our method is optimized using video-sentence pairs.WSLLN is based on a two-branch architecture where one branch performs segment-sentence alignment and the other one conducts segment selection.Experiments show that WSLLN achieves promising results on ActivityNet Captions and DiDeMo.

Figure 1 :
Figure1: The workflow of our method.Visual and text features are extracted from n video proposals and the input sentence.Fully-connected (FC) layers are used to transform the features to the same length, d.The two features are combined by multi-modal processing(Gao et al., 2017a) and input to the two-branch structure.Scores from both parts are merged.Video-level scores, vq, are obtained by summing s over proposals.The whole pipeline is trained end-to-end using video-level and pseudo segment-level labels.x × z indicates dimensions.
ActivityNet Captions Dataset Description.ActivityNet Captions (Krishna et al., 2017) is a large-scale dataset of human activities.It contains 20k videos including 100k video-sentences in total.We train our models on the training set and test them on the validation set.

Table 2 :
R@1 results of our method on ActivityNet Captions when λ in Eq. 7 is set to be different values.

Table 3 :
Ablation study based on R@1 on ActivityNet Captions.Both methods are trained using weak supervisions.