Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video

In this paper, we address a novel task, namely weakly-supervised spatio-temporally grounding natural sentence in video. Specifically, given a natural sentence and a video, we localize a spatio-temporal tube in the video that semantically corresponds to the given sentence, with no reliance on any spatio-temporal annotations during training. First, a set of spatio-temporal tubes, referred to as instances, are extracted from the video. We then encode these instances and the sentence using our newly proposed attentive interactor which can exploit their fine-grained relationships to characterize their matching behaviors. Besides a ranking loss, a novel diversity loss is introduced to train our attentive interactor to strengthen the matching behaviors of reliable instance-sentence pairs and penalize the unreliable ones. We also contribute a dataset, called VID-sentence, based on the ImageNet video object detection dataset, to serve as a benchmark for our task. Results from extensive experiments demonstrate the superiority of our model over the baseline approaches.


Introduction
Given an image/video and a language query, image/video grounding aims to localize a spatial region in the image (Plummer et al., 2015;Yu et al., 2017Yu et al., , 2018 or a specific frame in the video (Zhou et al., 2018) which semantically corresponds to the language query. Grounding has broad applications, such as text based image retrieval (Chen et al., 2017;, description generation (Wang et al., 2018a;Rohrbach et al., 2017; A brown and white dog is lying on the grass and then it stands up. ... ... Figure 1: The proposed WSSTG task aims to localize a spatio-temporal tube (i.e., the sequence of green bounding boxes) in the video which semantically corresponds to the given sentence, with no reliance on any spatio-temporal annotations during training. Wang et al., 2018b), and question answer (Gao et al., 2018;Ma et al., 2016). Recently, promising progress has been made in image grounding (Yu et al., 2018;Chen et al., 2018c;Zhang et al., 2018) which heavily relies on fine-grained annotations in the form of region-sentence pairs. Fine-grained annotations for video grounding are more complicated and labor-intensive as one may need to annotate a spatio-temporal tube (i.e., label the spatial region in each frame) in a video which semantically corresponds to one language query.
To avoid the intensive labor involved in dense annotations, (Huang et al., 2018) and (Zhou et al., 2018) considered the problem of weaklysupervised video grounding where only aligned video-sentence pairs are provided without any fine-grained regional annotations. However, they both ground only a noun or pronoun in a static frame of the video. As illustrated in Fig. 1, it is difficult to distinguish the target dog (denoted by the green box) from other dogs (denoted by the red boxes) if we attempt to ground only the noun "dog" in one single frame of the video. The main reason is that the textual description of "dog" is not sufficiently expressive and the visual appearance in one single frame cannot characterize the spatio-temporal dynamics (e.g., the action and movements of the "dog").
In this paper, we introduce a novel task, referred to as weakly-supervised spatio-temporally grounding sentence in video (WSSTG). Specifically, given a natural sentence and a video, we aim to localize a spatio-temporal tube (i.e., a sequence of bounding boxes), referred to as an instance, in the video which semantically matches the given sentence (see Fig. 1). During training, we do not rely on any fine-grained regional annotations. Compared with existing weaklysupervised video grounding problems (Zhou et al., 2018;Huang et al., 2018), our proposed WSSTG task has the following two advantages and challenges. First, we aim to ground a natural sentence instead of just a noun or pronoun, which is more comprehensive and flexible. As illustrated in Fig. 1, with a detailed description like "lying on the grass and then it stands up", the target dog (denoted by green boxes) can be localized without ambiguity. However, how to comprehensively capture the semantic meaning of a sentence and ground it in a video, especially in a weaklysupervised manner, poses a challenge. Second, compared with one bounding box in a static frame, a spatio-temporal tube (denoted by a sequence of green bounding boxes in Fig. 1) presents the temporal movements of "dog", which can characterize its visual dynamics and thereby semantically match the given sentence. However, how to exploit and model the spatio-temporal characteristics of the tubes as well as their complicated relationships with the sentence poses another challenge.
To handle the above challenges, we propose a novel model realized within the multiple instance learning framework (Karpathy and Fei-Fei, 2015;Tang et al., 2017Tang et al., , 2018. First, a set of instance proposals are extracted from a given video. Features of the instance proposals and the sentence are then encoded by a novel attentive interactor that exploits their fine-grained relationships to generate semantic matching behaviors. Finally, we propose a diversity loss, together with a ranking loss, to train the whole model. During testing, the instance proposal which exhibits the strongest semantic matching behavior with the given sentence is selected as the grounding result. To facilitate our proposed WSSTG task, we contribute a new grounding dataset, called VIDsentence, by providing sentence descriptions for the instances of the ImageNet video object detection dataset (VID) (Russakovsky et al., 2015). Specifically, 7, 654 instances of 30 categories from 4, 381 videos in VID are extracted. For each instance, annotators are asked to provide a natural sentence describing its content. Please refer to Sec. 4 for more details about the dataset.
Our main contributions can be summarized as follows. 1) We tackle a novel task, namely weakly-supervised spatio-temporally video grounding (WSSTG), which localizes a spatiotemporal tube in a given video that semantically corresponds to a given natural sentence, in a weakly-supervised manner. 2) We propose a novel attentive interactor to exploit fine-grained relationships between instances and the sentence to characterize their matching behaviors. A diversity loss is proposed to strengthen the matching behaviors between reliable instance-sentence pairs and penalize the unreliable ones during training. 3) We contribute a new dataset, named as VID-sentence, to serve as a benchmark for the novel WSSTG task. 4) Extensive experimental results are analyzed, which illustrate the superiority of our proposed method.

Related Work
Grounding in Images/Videos. Grounding in images has been popular in the research community over the past decade (Kong et al., 2014;Matuszek et al., 2012;Wang et al., 2016a,b;Li et al., 2017;Cirik et al., 2018;Sadeghi and Farhadi, 2011;Zhang et al., 2017;Xiao et al., 2017;Chen et al., 2019Chen et al., , 2018a. In recent years, researchers also explore grounding in videos. Yu and Siskind (2015) grounded objects in constrained videos by leveraging weak semantic constraints implied by a sequence of sentences. Vasudevan et al. (2018) grounded objects in the last frame of stereo videos with the help of text, motion cues, human gazes and spatial-temporal context. However, fully supervised grounding requires intensive labor for regional annotations, especially in the case of videos.
Weakly-Supervised Grounding. To avoid the intensive labor involved in regional annotations, weakly-supervised grounding has been proposed where only image-sentence or videosentence pairs are needed. It was first studied in the image domain (Zhao et al., 2018;. Later, given a sequence of transcriptions and their corresponding video clips as well as their temporal alignment, Huang et al. (2018) Figure 2: The architecture of our model. An instance generator is used to produce spatio-temporal instances. An attentive interactor is proposed to exploit the complicated relationships between instances and the sentence. Multiple instance learning is used to train the model with a ranking loss and a diversity loss.
grounded nouns/pronouns in specific frames by constructing a visual grounded action graph. The work closest to ours is (Zhou et al., 2018), in which the authors grounded a noun in a specific frame by considering object interactions and loss weighting given one video and one text input. In this work, we also focus on grounding in a videotext pair. However, different from (Zhou et al., 2018) whose text input consists of nouns/pronouns and output is a bounding box in a specific frame, we aim to ground a natural sentence and output a spatio-temporal tube in the video.

Method
Given a natural sentence query q and a video v, our proposed WSSTG task aims to localize a spatio-temporal tube, referred to as an instance, p = {b t } T t=1 in the video sequence, where b t represents a bounding box in the t-th frame and T denotes the total number of frames. The localized instance should semantically correspond to the sentence query q. As WSSTG is carried out in a weakly-supervised manner, only aligned videosentence pairs {v, q} are available with no finegrained regional annotations during training. In this paper, we cast the WSSTG task as a multiple instance learning problem (Karpathy and Fei-Fei, 2015). Given a video v, we first generate a set of instance proposals by an instance generator (Gkioxari and Malik, 2015). We then identify which instance semantically matches the natural sentence query q.
We propose a novel model for handling the WSSTG task. It consists of two components, namely an instance generator and an attentive interactor (see Fig. 2). The instance generator links bounding boxes detected in each frame into instance proposals (see Sec. 3.1). The attentive interactor exploits the complicated relationships between instance proposals and the given sentence to yield their matching scores (see Sec. 3.2). The proposed model is optimized with a ranking loss L rank and a novel diversity loss L div (see Sec. 3.3). Specifically, L rank aims to distinguish aligned video-sentence pairs from the unaligned ones, while L div targets strengthening the matching behaviors between reliable instance-sentence pairs and penalizing the unreliable ones from the aligned video-sentence pairs.

Instance Extraction
Instance Generation. As shown in Fig. 2, the first step of our method is to generate instance proposals. Similar to (Zhou et al., 2018), the region proposal network from Faster- RCNN (Ren et al., 2015) is used to detect frame-level bounding boxes with corresponding confidence scores, which are then linked to produce spatio-temporal tubes.
Let b t denote a detected bounding box at time t and b t+1 denote another box at time t + 1. Following (Gkioxari and Malik, 2015), we define the linking score s l between b t and b t+1 as is the intersection-over-union (IoU) of b t and b t+1 , and λ is a balancing scalar which is set to 0.2 in our implementation.
As such, one instance proposal p n can be viewed as a path {b n t } T t=1 over the whole video sequence with energy E(p n ) given by We identify the instance proposal with the maximal energy by the Viterbi algorithm (Gkioxari and Malik, 2015). We keep the identified instance proposal and remove all the bounding boxes associated with it. We then repeat the above process until there is no bounding box left. This results in a set of instance proposals P = {p n } N n=1 , with N being the total number of proposals.
Feature Representation. Since an instance proposal consists of bounding boxes in consecutive video frames, we use I3D (Carreira and  It consists of two components, namely interaction and matching behavior characterization. A denotes the attention mechanism in Eqs. (4-6). φ denotes the function in Eq. (7).
Zisserman, 2017) and Faster-RCNN to generate the RGB sequence feature I3D-RGB, the flow sequence feature I3D-Flow, and the frame-level RoI pooled feature, respectively. Note that it is not effective to encode each bounding box as an instance proposal may include thousands of bounding boxes. We therefore evenly divide each instance proposal into t p segments and average the features within each segment. t p is set to 20 for all our experiments. We concatenate all three kinds of visual features before feeding it into the following attentive interactor. Taking each segment as a time step, each proposal p is thereby represented as F p ∈ R tp×dp , a sequence of d p dimensional concatenated visual features at each step.

Attentive Interactor
With the instance proposals from the video and the given sentence query, we propose a novel attentive interactor to characterize the matching behaviors between each proposal and the sentence query. Our attentive interactor consists of two coupling components, namely interaction and matching behavior characterization (see Fig. 3). Before diving into the details of the interactor, we first introduce the representation of the query sentence q. We represent each word in q using the 300-dimensional word2vec (Mikolov et al., 2013) and omit words that are not in the dictionary. In this way, each sentence q is represented as F q ∈ R tq×dq , where t q is the total number of words in the sentence and d q denotes the dimension of the word embedding.

Interaction
Given the sequential visual features F p ∈ R tp×dp of one candidate instance and the sequential textual features F q ∈ R tq×dq of the query sentence, we propose an interaction module to exploit their complicated matching behaviors in a finegrained manner. First, two long short-term memory networks (LSTMs) (Hochreiter and Schmidhuber, 1997) are utilized to encode the instance proposal and sentence, respectively: where f p t and f q t are the t-th row representations in F p and F q , respectively. Due to the natural characteristics of LSTM, h p t and h q t , as the yielded hidden states, encode and aggregate the contextual information from the sequential representation, and thereby yield more meaningful and informative visual features H p = {h p t } tp t=1 and sentence repre- . Different from Zhao et al., 2018) which used only the last hidden state h q tq as the feature embedding for the query sentence, we generate visually guided sentence features H qp = {h qp t } tp t=1 by exploiting their fine-grained relationships based on H q and H p . Specifically, given the i-th visual feature h p i , an attention mechanism (Xu et al., 2015) is used to adaptively summarize H q = {h q t } tq t=1 with respect to h p i : where W q ∈ R K×Dq , W p ∈ R K×Dp , b 1 ∈ R K are the learnable parameters that map visual and sentence features to the same K-dimension space. w ∈ R K and b 2 ∈ R work on the coupled textual and visual features and yield their affinity scores. With respect to W p h p i in Eq. (4), the generated visually guided sentence feature h qp i pays more attention on the words more correlated with h p i by adaptively summarizing H q = {h q t } tq t=1 . Owning to the attention mechanism in Eqs. (4-6), our proposed interaction module makes each visual feature interact with all the sentence features and attentively summarize them together. As such, fine-grained relationships between the visual and sentence representations are exploited.

Matching Behavior Characterization
After obtaining a set of visually guided sentence features H qp = {h qp t } tp t=1 , we characterize the fine-grained matching behaviors between the visual and sentence features. Specifically, the matching behavior between the i-th visual and sentence features is defined as The instantiation of φ can be realized by different approaches, such as multi-layer perceptron (MLP), inner-product, or cosine similarity. In this paper, we use cosine similarity between h p i and h qp i for simplicity. Finally, we define the matching behavior between an instance proposal p and the sentence q as

Training
For the WSSTG task, since no regional annotations are available during the training, we cannot optimize the framework in a fully supervised manner. We, therefore, resort to MIL to optimize the proposed network based on the obtained matching behaviors of the instance-sentence pairs. Specifically, our objective function is defined as where L rank is a ranking loss, aiming at distinguishing aligned video-sentence pairs from the unaligned ones. L div is a novel diversity loss, which is proposed to strengthen the matching behaviors between reliable instance-sentence pairs and penalize the unreliable ones from the aligned videosentence pair. β is a scalar which is set to 1 in all our experiments. Ranking Loss. Assume that {v, q} is a semantically aligned video-sentence pair. We define the visual-semantic matching score S between v and q as S(v, q) = max s(q, pn) , n = 1, ..., N , where p n is the n-th proposal generated from the video v, s(q, p n ) is the matching behavior computed by Eq. (8), and N is the total number of instance proposals.
Suppose that v and q are negative samples that are not semantically correlated with q and v, respectively. Inspired by (Karpathy and Fei-Fei, 2015), we define the ranking loss as where ∆ is a margin which is set to 1 in all our experiments. L rank directly encourages the matching scores of aligned video-sentence pairs to be larger than those of unaligned pairs. Diversity Loss. One limitation of the ranking loss defined in Eq. (11) is that it does not consider the matching behaviors between the sentence and different instance proposals extracted from an aligned video. A prior for video grounding is that only a few instance proposals in the paired video are semantically aligned to the query sentence, while most of the other instance proposals are not. Thus, it is desirable to have a diverse distribution of the matching behaviors {s(q, p n )} N n=1 . To encourage a diverse distribution of {s(q, p n )} N n=1 , we propose a diversity loss L div to strengthen the matching behaviors between reliable instance-sentence pairs and penalize the unreliable ones during training. Specifically, we first normalize {s(q, p n )} N n=1 by softmax s (q, pn) = exp(s(q, pn)) N n =1 exp(s(q, p n )) , and then penalize the entropy of the distribution of {s (q, p n )} N n=1 by defining the diversity loss as s (q, pn)log(s (q, pn)).
Note that the smaller L div is, the more diverse {s(q, p n )} N n=1 will be, which implicitly encourages the matching scores of semantically aligned instance-sentence pairs being larger than those of the misaligned pairs.

Inference
Given a testing video and a query sentence, we extract candidate instance proposals, and characterize the matching behavior between each instance proposal and the sentence by the proposed attentive interactor. The instance with the strongest matching behavior is deemed the result of the WSSTG task.
A red bus is making a turn on the road A red bus is making a turn on the road

A brown and white dog is lying on the grass and then standing up
A large elephant runs in the water from left to right A red bus is making a turn on the road A brown and white dog is lying on the grass and then standing up A large elephant runs in the water from left to right Figure 4: Samples of the newly constructed VIDsentence dataset. Sentences are shown on the top of images and the associated target instances are enclosed with green bounding boxes.

VID-sentence Dataset
A main challenge for the WSSTG task is the lack of suitable datasets. Existing datasets like TACoS (Regneri et al., 2013) and YouCook (Das et al., 2013) are unsuitable as they do not provide spatio-temporal annotations for target instances in the videos, which are necessary for the WSSTG task for evaluation. To the best of our knowledge, the most suitable existing dataset is the Personsentence dataset provided by (Yamaguchi et al., 2017), which is used for spatio-temporal person search among videos. However, this dataset is too simple for the WSSTG task since it contains only people in the videos. To this end, we contribute a new dataset by annotating videos in Ima-geNet video object detection dataset (VID) (Russakovsky et al., 2015) with sentence descriptions. We choose VID as the visual materials for two primary reasons. First, it is one of the largest video detection datasets containing videos of diverse categories in complicated scenarios. Second, it provides dense bounding-box annotations and instance IDs which help avoid labor-intensive annotations for spatio-temporal regions of the validation/testing set. VID-sentence Annotation. With 30 categories, VID contains 3826, 555 and 937 videos for training, validation and testing respectively. We first divide videos in training and validation sets 1 into trimmed videos based on the provided instance IDs, and delete videos less than 9 frames. As such, there remain 9, 029 trimmed videos in total. In each trimmed video, one instance is identified as a sequence of bounding boxes. A group of annotators are asked to provide sentence descriptions for the target instances. Each target instance is 1 Testing set is omitted as its spatial-temporal annotations are unavailable annotated with one sentence description. An instance is discarded if it is too difficult to provide a unique and precise description. After annotation, there are 7, 654 videos with sentence descriptions. We randomly select 6, 582 videos as the training set, and evenly split the remaining videos into the validation and testing sets (i.e., each contains 536 videos). Some examples from the VID-sentence dataset are shown in Fig. 4. Dataset Statistics.
It covers all 30 categories in VID, such as "car", "monkey" and "watercraft". The size of the vocabulary is 1, 823 and the average length of the descriptions is 13.2. Table 1 shows the statistics of our constructed VID-sentence dataset. Compared with the Person-sentence dataset, our VID-sentence dataset has a similar description length but includes more instances and categories.
It is important to note that, although VID provides regional annotations for the training set, these annotations are not used in any of our experiments since we focus on weakly-supervised spatio-temporal video grounding.

Experiments
In this section, we first compare our method with different kinds of baseline methods on the created VID-sentence dataset, followed by the ablation study. Finally, we show how well our model generalizes on the Person-sentence dataset.

Experimental Settings
Baseline Models. Existing weakly-supervised video grounding methods (Huang et al., 2018;Zhou et al., 2018) are not applicable to the WSSTG task. Huang et al. (2018) requires temporal alignment between a sequence of transcription descriptions and the video segments to ground a noun/pronoun in a certain frame, while Zhou et al. (2018) mainly grounds nouns/pronouns in specific frames of videos. As such, we develop three baselines based on DVSA (Karpathy and Fei-Fei, 2015), GroundeR , and a variant frame-level method modified from (Zhou et al., 2018) for performance comparisons. Following recent grounding methods like Chen et al., 2018b), we use the last hidden state of an LSTM encoder as the sentence embedding for all the baselines. Since DVSA and GroundeR are originally proposed for image grounding, in order to adapt to video, we consider three methods to encode visual features F p ∈ R tp×dp including averaging (Avg), NetVLAD (Arandjelovic et al., 2016), and LSTM. For the variant baseline modified from (Zhou et al., 2018), we densely predict each frame to generate a spatio-temporal prediction. Implementation Details. Similar to (Zhou et al., 2018), we use the region proposal network from Faster-RCNN pretrained on MSCOCO  to extract frame-level region proposals. For each video, we extract 30 bounding boxes for each frame and link them into 30 spatio-temporal tubes with the method (Gkioxari and Malik, 2015). We map the word embedding to 512-dimension before feeding it to the LSTM encoder. Dimension of the hidden state of all the LSTMs is set to 512. Batch size is 16, i.e., 16 videos with total 480 instance proposals and 16 corresponding sentences. We construct positive and negative video-sentence pairs for training within a batch for efficiency, i.e., roughly 16 positive pairs and 240 negative pairs for the triplet construction. SGD is used to optimize the models with a learning rate of 0.001 and momentum of 0.9. We train all the models with 30 epochs. Please refer to supplementary materials for more details. Evaluation Metric. We use the bounding box localization accuracy for evaluation. An output instance is considered as "accurate" if the overlap between the detected instance and the groundtruth is greater than a threshold η. The definition of the overlap is the same as (Yamaguchi et al., 2017), i.e., the average overlap of the bounding boxes in annotated frames. η is set to 0.4, 0.5, 0.6 for extensive evaluations. Table 2 shows the performance comparisons between our model and the baselines. We additionally show the performance of randomly choosing an instance proposal and the upper bound perfor-  mance of choosing the instance proposal of the largest overlap with the ground-truth. The results suggest that, 1) models with NetVLAD (Arandjelovic et al., 2016) perform the worst. We suspect that models based on NetVLAD are complicated and the supervisions are too weak to optimize the models sufficiently well. 2) Models with LSTM embedding achieve only comparable performances compared with models based on simple averagingf. It is mainly due to the fact that the power of LSTM has not been fully exploited. 3) The variant method of (Zhou et al., 2018) performs better than both DVSA and GroundeR with various kinds of visual encoding techniques, indicating its power for the task. 4) Our model achieves the best results, demonstrating its effectiveness, showing that our model is better at characterizing the matching behaviors between the query sentence and the visual instances in the video.

Performance Comparisons
To compare the methods qualitatively, we show an exemplar sample in Fig. 5. Compared with GroundeR+LSTM and DVSA+LSTM, our method identifies a more accurate instance from the candidate instance proposals. Moreover, the instances generated by our method are more temporally consistent compared with the modified frame-level method (Zhou et al., 2018). This can be attributed to the exploitation of the temporal information during instance generation and attentive interactor in our model.

Ablation Study
To verify the contributions of the proposed attentive interactor and diversity loss, we perform the following ablation study. To be specific, we compare the full method with three variants, includ-   Intuitively, the attention weight matches well with the visual contents such as "puppy" in all three segments and "hand" in the segment with ID 2. Best viewed on screen.
ing: 1) removing both the attentive interactor and diversity loss, which is equivalent to the DVSA model using LSTM for encoding both the visual features and sentence features, termed as Base; 2) Base+Div, which is formed by introducing the diversity loss; 3) Base+Int with the attentive interactor module. Table 3 shows the corresponding results. Compared with Base, both the diversity loss and attentive interactor constantly improve the performance. Moreover, to show the effectiveness of the proposed attentive interactor, we visualize the adaptive weight a in Eq. (5). As shown in Fig. 6,   our method adaptively pays more attention to the words that match the instance such as the "puppy" in all three segments and the "hand" in segment with ID 2. To show the effectiveness of the diversity loss, we divide instance proposals in the testing set into 10 groups based on their IoU scores with the ground-truth and then calculate the average matching behaviors of each group, predicted by counterparts with and without the diversity loss. As shown in Fig. 7, the proposed diversity loss L div penalizes the matching behaviors of the instances of lower IoU with ground-truth while strengthens instances of higher IoU.

Experiments on Person-sentence Dataset
We further evaluate our model and the baseline methods on the Person-sentence dataset (Yamaguchi et al., 2017). We ignore the bounding box annotations in the training set and carry out experiments for the proposed WSSTG task. For fair comparisons, all experiments are conducted on the visual feature extractor provided by (Carreira and Zisserman, 2017). Table 4 shows the results. Similarly, the proposed attentive interactor model (without the diversity loss) outperforms all the baselines. Moreover, the diversity loss further improves the performance. Note that the improvement of our model on this dataset is more significant than that on the VID-sentence dataset. The reason might be that the upper bound performance of the Personsentence is much higher than that of the VIDsentence (77.9 for Person-sentence versus 47.6 for VID-sentence on average). This also suggests that the created VID-sentence dataset is more challenging and more suitable as a benchmark dataset.

Conclusion
In this paper, we introduced a new task, namely weakly-supervised spatio-temporally grounding natural sentence in video. It takes a sentence and a video as input and outputs a spatio-temporal tube from the video, which semantically matches the sentence, with no reliance on spatio-temporal annotations during training. We handled this task based on the multiple instance learning framework. An attentive interactor and a diversity loss were proposed to learn the complicated relationships between the instance proposals and the sentence. Extensive experiments showed the effectiveness of our model. Moreover, we contributed a new dataset, named as VID-sentence, which can serve as a benchmark for the proposed task.