Reasoning Step-by-Step: Temporal Sentence Localization in Videos via Deep Rectification-Modulation Network

Temporal sentence localization in videos aims to ground the best matched segment in an untrimmed video according to a given sentence query. Previous works in this field mainly rely on attentional frameworks to align the temporal boundaries by a soft selection. Although they focus on the visual content relevant to the query, these single-step attention are insufficient to model complex video contents and restrict the higher-level reasoning demand for this task. In this paper, we propose a novel deep rectification-modulation network (RMN), transforming this task into a multi-step reasoning process by repeating rectification and modulation. In each rectification-modulation layer, unlike existing methods directly conducting the cross-modal interaction, we first devise a rectification module to correct implicit attention misalignment which focuses on the wrong position during the cross-interaction process. Then, a modulation module is developed to capture the frame-to-frame relation with the help of sentence information for better correlating and composing the video contents over time. With multiple such layers cascaded in depth, our RMN progressively refines video and query interactions, thus enabling a further precise localization. Experimental evaluations on three public datasets show that the proposed method achieves state-of-the-art performance. Extensive ablation studies are carried out for the comprehensive analysis of the proposed method.


Introduction
Localizing activities in videos (Regneri et al., 2013;Yuan et al., 2016;Gavrilyuk et al., 2018;Feng et al., 2018;Feng et al., 2019) is an important topic in information retrieval systems. As most videos contain activities of interest with complicated background contents, these videos cannot be directly indicated by a pre-defined list of action classes. Recently, a new task called temporal sentence localization in videos (Gao et al., 2017;Anne Hendricks et al., 2017) is proposed to tackle this problem, attracting great interests from both vision and language communities (Liu et al., 2020;. Given an untrimmed video, this task aims to infer the start and end timestamps of a target video segment which contains the interested activity according to a given sentence query. Traditional methods (Gao et al., 2017;Ge et al., 2019;Chen and Jiang, 2019;Anne Hendricks et al., 2017) are based on sliding windows, which first sample candidate video segments and then compare the sentence with each video segment separately to calculate the matching relationships. These methods cannot achieve precise alignment between video and sentence, thus leading to inaccurate temporal boundaries. Recently, some works Zhang et al., 2019b;Zhang et al., 2019a) try to avoid this problem by designing end-to-end models. They first integrate the features of the whole video with sentence information and then utilize LSTM or CNN layer to compose such integrated video features for further segment localization. Although these methods achieve promising results, there are still some problems need to be concerned.
First, previous works formally adopt single-step attention for multi-modal feature interaction, which limits the modeling power for two reasons: 1) It can not mine sufficient relationship between modalities; 2) Once the cross-modal relation focuses on the wrong position without further calibration, it can heavily jeopardize the localization performance. For example, as shown in Figure 1 Figure 1: Illustration of rectification and modulation modules for precisely localization. Left: The rectification module helps correct the attention to focus on the best matched position. Right: The modulation module correlates the video contents with different weights referring to different sentence semantics.
the activity "the girl in the blue dress hops for a second time". All frames contain the similar visual appearance with "girl" and "blue dress", and single-step reasoning may guide the model focus on the action word "hops". It is hard to lead the model directly pay more attention on the adjective "second". Therefore, a multi-steps reasoning framework needs to be developed for not only rectifying the attention errors from the previous reasoning step, but also helping model gradually focus on the most matched words or frames. Second, previous works mainly focus on aligning the sentence information with video clips. Although it is crucial to capture such cross-modal relation between two modalities for highlighting the matched words or frames, the self-relation among video frames is also important for correlating and composing the sentence related video contents over time. To effectively model temporal activities, such self-relations need modulation by the information from other modality, namely the relations between visual frames should be weighted differently according to different sentence queries. As shown in Figure 1 (right), the video contains multiple segment-sentence annotation pairs, the third frame should be correlated with the second one when querying sentence S2 but this correlation should not be established when given the sentence S1. Therefore, how to modulate the temporal relation among video frames conditioned on the matched words from the whole sentence is vital for this task.
In this paper, we propose a novel rectification-modulation network (RMN), which modulates conditioned temporal relation with multiple reasoning steps for temporal sentence localization in videos. In the rectification module, to avoid the error accumulation of the wrong relation from previous reasoning step, we adopt the initial modal feature as a global information flow to correct the attention errors. In the modulation module, we modulate the temporal relation among frames according to the sentence semantics for better correlating sentence-related video contents over time. With multiple such rectification-modulation layers cascaded in depth, our model can reasoning higher-order multi-modal interaction step-by-step, providing more accurate video segment boundaries.
In summary, this paper makes following contributions: • We propose a novel rectification-modulation network (RMN), which adopts a multi-step reasoning framework to gradually capturing higher-order multi-modal interaction. • The rectification module utilizes initial multi-modal features as the global information to help our model rectify the attention which focuses on the wrong position from the previous reasoning step. • The modulation module considers the self-modal relation between video frames conditioned on the sentence semantics. In this way, each frame can be associated with most matched words for correlating the interested video contents. • We conduct experiments on three public datasets, and verify the effectiveness of our proposed RMN with the superiority over the state-of-the-art methods.

Related Work
Temporal sentence localization in videos is a new task introduced recently (Gao et al., 2017;Anne Hendricks et al., 2017), which aims to localize the most relevant video segment from a video with text descriptions. Traditional methods Gao et al., 2017) adopt a two-stage multi-modal matching strategy which firstly sample candidate segments from a video, and subsequently integrate query with segment representations via a matrix operation. However, these methods lack a comprehensively structure for effective multi-modal features interaction. Based on such multi-modal matching  Figure 2: An overview of our proposed rectification-modulation network (RMN). We first embed both visual and language representations by the multi-modal encoders. Then, multi-step rectificationmodulation layers are developed to correlate and compose the video contents referring to the sentencerelated information. At last, we integrate the multi-modal features for moment localization.
framework, some works (Xu et al., 2019;Chen and Jiang, 2019;Ge et al., 2019) integrate the sentence representation with those video segments individually, and then evaluated their matching relationships through the integrated features. For instance, Xu (Xu et al., 2019) introduce a multi-level model to integrate visual and textual features and further re-generate queries as an auxiliary task. Ge (Ge et al., 2019) and Chen  capture the evolving fine-grained frame-by-word interactions between video and query to enhance the video representation understanding.
Recently, other works Wang et al., 2020;Zhang et al., 2019b;Zhang et al., 2019a;Mithun et al., 2019) propose to directly integrate sentence information with each finegrained video clip unit, and predict the temporal boundary of the target segment by gradually merging the fusion feature sequence over time. Wang (Wang et al., 2020) aggregate contextual information by explicitly modeling the relationship between the current element and its neighbors. Zhang (Zhang et al., 2019a) model relations among candidate segments with the guidance of the query information. To modulate temporal convolution operations, Yuan  and Mithun (Mithun et al., 2019) introduce the sentence information as a critical prior to compose and correlate video contents.
Although existing methods perform well in this task, all of them adopt a single-step model and only consider aligning the sentence information with video clips, ignoring to associate the video frames conditioned on the sentence features for more precisely moment localization. In this paper, we develop a modulation module to modulate the conditioned temporal relation for contents correlating and composing. Moreover, we repeat the rectification-modulation layer multiple times for deeper reasoning.

The Proposed RMN Model
Given an untrimmed video V and a sentence query Q, the task aims to determine the start and end timestamps (s, e) of a specific video segment, which corresponds to the activity of the given sentence query. Formally, we represent the video as V = {v t } T t=1 frame-by-frame, and denote the given sentence query as Q = {q n } N n=1 word-by-word. With the training set {V , Q, (s, e)}, in this paper, we propose a deep rectification-modulation network (RMN) to learn to predict the most relevant video segment boundary (ŝ,ê). As shown in Figure 2, our method contains four parts: multi-modal encoding, multistep rectification-modulation layers, multi-modal integration, and moment localization.

Video and Query Encoding
For video encoding, we first extract the frame-wise features by a pre-trained C3D network (Tran et al., 2015), and then employ a self-attention (Vaswani et al., 2017) module to learn the semantic dependencies in the long video context. Considering the sequential characteristic in video, a bi-directional GRU (Chung et al., 2014) is further utilized to incorporate the contextual information. For query information encoding, we first extract the word embeddings by Glove (Pennington et al., 2014), and then feed them into another bi-directional GRU to integrate the sequential information. We denote the embedddings of video and query as V = {v t } T t=1 ∈ R T ×d and Q = {q n } N n=1 ∈ R N ×d , respectively.

Multi-Step Rectification-Modulation Layers
In the task of temporal sentence localization in videos, besides understanding the video clip contents, how to capture their temporal correlations plays an even more important role. Luckily, the query sentence presents rich semantic indications on such important correlations, providing crucial information to temporally associate and compose the consecutive video contents over time. Based on the above considerations, we propose a modulation module, which modulates the temporal frame-to-frame relation conditioned on the sentence semantic information for better composing the video contents. To avoid error interaction focused on the wrong position, we additionally develop a rectification module to correct the attention error from previous reasoning step. We conduct rectification and modulation on both video and query to enhance the information flows. With multiple such layers cascaded in depth, our model can reasoning higher-order multi-modal interaction for more precise video segment localization.
Notations. At l-th rectification-modulation layer, we define the multi-modal representation inputs and outputs asV l−1 ,Q l−1 andV l ,Q l , respectively. We also denote the multi-modal hidden states from previous reasoning layer as H l−1 V , H l−1 Q , which are utilized as constraints for cross-modal interaction and conditions for self-modal interaction. Specifically, we initializeV 0 ,Q 0 = V + PE(V ), Q + PE(Q) with positional encoding (Vaswani et al., 2017) which takes additional positional knowledge to enhance the semantic information, and we set the initial hidden states as H 0 V , H 0 Q =V 0 ,Q 0 , respectively. Rectification Module. Given the multi-modal representations and hidden states from the previous layer, we first aim to rectify the attention error if the learned relation from previous reasoning step is focused on the wrong position. Specifically, we utilize the initial modal features V , Q as global information to regularize and re-correct the the multi-modal flowV l−1 ,Q l−1 by an update gate: where σ is sigmoid function, W Z , W z , W v , W q are the parameters of linear layers. denotes the element-wise multiplication. With such rectified representations of two modalities, we further utilize cross information flow from other modality to enhance the current modal representation for each modality. Instead of directly computing the cross-relation between representations (Ṽ l−1 ,Q l−1 ), we consider more detailed latent clues from the hidden states (H l−1 Q , H l−1 V ) from previous reasoning step which can provide more discriminative information for each modality. Following the co-attention mechanism (Lu et al., 2016), we calculate the correlation matrix of cross-modal instances as follows: where W V , W Q , W H , W h are the learnable parameters. Each row of M l V denotes the similarity of all word features to a specific frame feature, and each row of M l Q represents the similarity of all frame features to a specific word feature. The value of each similarity will be high if the word-frame pair is relevant or it will be low. To aggregate cross-modal information I l V , I l Q forṼ l−1 ,Q l−1 , we utilize a weighted summation strategy based on the correlation matrix M l V , M l Q as follows: Therefore, we can get the enhanced rectified video features S l V and enhanced rectified sentence features S l Q by a simple addition function like (Fukui et al., 2016) on two information flows by: Modulation Module. After obtaining the enhanced rectified video and sentence features, it is also important to capture their temporal correlations among each modality. Like the cross-modal attention mechanism, we can directly calculate the frame-to-frame or word-to-word relations and compute the normalized weights A for each instance in each modality by: where W S , W s are the parameters of linear layers. Although such naive self-attention matrix A estimates the frame-to-frame and word-to-word importance, the relations which can only be identified conditioned on information from the other modality can not be captured. For example, if the video contains multiple moment-sentence annotation pairs, the relations between different visual frames should be weighted differently according to different given sentence query. As the given sentence query presents rich semantic indications on such important correlations for better correlating and composing the consecutive video contents over time, we tend to modulate the temporal frame-to-frame relations referring to the sentence semantics for improving the self-relation matrix A in Eq. (6) by: where (· ⊗ e T ) is the outer product to produce a matrix by repeating the vector on the left for T times, W {1,2,3,4,5,6} are the parameters of linear layers. By expanding the sentence/video features after mean pooling, the conditional information C from other modalities can be acquired. Channels of feature S would be further activated or deactivated by such channel-wise gates condition C, which shares the similar spirit with Squeeze and Excitation Network (Hu et al., 2018) and the Gated Convolution (Gehring et al., 2017). Therefore each temporal feature map can absorb the sentence semantic information, and further activate the self-correlation matrix for better associating and composing the sentence-related video contents. Words can also enhance its contextual meaning in the same way. We apply matrix multiplication on self-relation weights and multi-modal features to generate self-interacted information by: where we denote such self-interacted information as the hidden states H l V , H l Q for the input of next reasoning layer. We concatenate the self-interacted information with the enhanced rectified features as the final output of current rectification-modulation layer:

Multi-modal Integration
After multiple rectification-modulation layers, we utilize two linear layers on the two-modal outputs and then get the final video/sentence representationsV andQ. We additionally utilize a cosine similarity function (Mithun et al., 2019) to transfer the dimension ofQ as the same asV . To further emphasize crucial contents and weaken inessential parts among each modality, we design a gate function as follow: where W G , W g and b G , b g are learnable parameters. We then integrate the multi-modal features by:

Moment Localization
With the integrated representation f , we further apply a bi-directional GRU network to absorb more contextual evidences in temporal domain. To predict the target video segment, we first pre-define a set of candidate moments Φ t = {(ŝ t,i ,ê t,i )} N Φ i=1 with multi-scale windows  at each time t, where N Φ is the number of moments at current time-step. Then, we adopt a Conv1d layer to score these candidate moments and predict corresponding offsetsδ t = {(δ s t,i ,δ e t,i )} N Φ i=1 of them relative to the ground-truth. The confidence scores cs t = {cs t,i } N Φ i=1 for these moments can be formulated as follows: where σ(·) is the sigmoid function to normalize the confidence scores. Also, the temporal offsets of each candidate moment i at time t can be predicted by another Conv1d layer: Therefore, the final predicted moment i of time t can be presented as (ŝ t,i +δ s t,i ,ê t,i +δ e t,i ). Training. To learn the confidence scoring rule for candidate moments, we compute the IoU (Intersection over Union) score IoU t,i between each candidate moment (ŝ t,i ,ê t,i ) with the ground truth (s t , e t ). We adopt the alignment loss function to train the scoring rule as follows: Since parts of the pre-defined candidates are coarse in boundaries, to learning to offsets prediction, we only need to fine-tune the localization offsets of positive moment samples. We treat the candidate moment as a positive sample if its IoU t,i is larger than an IoU threshold τ . The moment boundary loss for offsets prediction can be formulated as: where N pos denotes the number of positive moments, and R 1 is the smooth L1 loss. Both two losses are jointly considered for training with the balanced hyper-parameter α as: Inference. We first rank all candidate moments according to their predicted confidence scores, and then adopt a non-maximum suppression (NMS) to select "Top n" moments as the prediction.

Datasets and Evaluation
Activity Caption (Krishna et al., 2017): It contains 20k untrimmed videos with 100k descriptions from more complicated human activities in daily life. Since the test split is withheld for competition, following public split, we 37,417, 17,505, and 17,031 query-segment pairs for training, validaiton and testing. TACoS (Regneri et al., 2013): This dataset is collected from cooking scenarios which contains 127 videos. We use the same split as (Gao et al., 2017), which includes 10146, 4589, 4083 query-segment pairs for training, validation and testing. Evaluation Metrics. Following previous works (Gao et al., 2017;, we adopt "R@n, IoU=m" as our evaluation metrics. The "R@n, IoU=m" is defined as the percentage of at least one of top-n selected moments having IoU larger than m.

Implementation Details
We utilize the 112 × 112 pixels shape of every frame of videos as input, and apply C3D (Tran et al., 2015) for ActivityNet Caption and TACoS, I3D (Carreira and Zisserman, 2017) for Charades-STA to encode the videos. We set the length of video feature sequences to 200 for Activity Caption and TACoS,  64 for Charades-STA. As for sentence encoding, we utilize Glove word2vec (Pennington et al., 2014) to embed each word to 300 dimension features. The hidden state dimension of BiGRU networks is set to 512. During moment localization, we adopt convolution kernel size of [16,32,64,96,128,160,192] for Activity Caption, [8,16,32,64] for TACoS,and [16,24,32,40] for Charades-STA. We set the stride of them as 0.5, 0.125, 0.125. We then set the high-score threshold τ to 0.45, and the balance hyper-parameter α to 0.001 for Activity Caption, 0.005 for TACoS and Charades-STA. We adopt 5 rectification-modulation layers for all datasets. We train our model with an Adam optimizer with leaning rate 8 × 10 −4 , 3 × 10 −4 , 4 × 10 −4 for Activity Caption, TACoS, and Charades-STA respectively.

Compared Methods
We compare our proposed model with the state-of-the-art baseline methods, which can be divided into two classes: 1) Sliding window based methods:

Performance Comparison
The performance comparisons of existing state-of-the-art methods on three datasets are shown in Table 1 and Table 2. We can observe that the our RMN achieves a new state-of-the-art performance under all evaluation metrics and benchmarks, demonstrating the superiority of our proposed model. For localizing complex human activities in Activity Caption and Charades-STA datasets, our model surpasses others with clear margin on both R@1 and R@5 metrics. Specifically, our method brings 3.33% and 3.55% absolute improvements in the strict metrics "R@1, IoU=0.7", and brings 6.03% and 2.94% absolute improvements in the strict metrics "R@5, IoU=0.7" on two datasets, respectively. For TACoS where the cooking activities take place in the same kitchen scene with some slightly varied cooking objects, it is hard to localize such fine-grained activities. However, our model still achieve the best performance on both R@1 and R@5 metrics with a clear margin.
The main reasons for our proposed RMN outperforming the state-of-the-art methods lies in two folds.   Table 4: Ablation studies of the rectification module and modulation module on the TACoS dataset.
First, instead of only capturing the cross-modal relations (eg. SCDM, GDP), we additionally modulate the temporal relations among frames referring to sentence-related semantic information. Such modulation module helps model better correlate and compose the most relevant video contents according to the sentence over time. Second, compared to single-step interaction methods (eg. CMIN, TGN), our multi-step reasoning process can gradually focus on the most contributed frames and words for better interaction. Also, rectification module is able to correct the attention error from previous reasoning step.

Ablation Study
How does rectification-modulation interaction layer help? The proposed rectification-modulation interaction layer is the key to our method to reason more higher-level interaction between two modalities. As shown in Table 3, we set the number of such interaction layer to 1 as our baseline model. Here, we first investigate the ablation study on such interaction layer with three variants of models: w/o R&M (without using both rectification and modulation), w/o REC (only without using rectification) and w/o MOD (only without using modulation). We can find that w/o R&M achieves the worst performance as it lacks of efficient interaction. Both w/o REC and w/o MOD achieve relatively higher results but still lower than the result of the default setting, which indicates that both rectification and modulation are crucial for this task. Moreover, we also investigate the influence of the number of stacked interaction layers. As shown in Table 3, we find that more layers can improve the performance thanks to our rectification module, and our model achieves the best result with 5 layers. How does rectification module help? The rectification module integrates the previous reasoning output with a global information flow from initial modal features. We conduct the ablation study on the usage of such global flow with different settings: we denote w/ ADD as the addition operation illustrated in Eq.
(1); we remove the global flow as w/o ADD; we replace the addition operation with concatenation (w/ CONC); and we utilize all previous layer features (Nam et al., 2017) including the initial feature as the global flow (w/ MEM). As shown in Table 4, the w/o ADD achieves the worst performance as it lacks of attention rectification. The w/ ADD outperforms than the other two models, it denotes that the initial modal features are more effective than all previous layers features for rectifying the attention error. How does modulation module help? To evaluate the contribution of our conditional modulation module, we conduct an ablation study on different condition methods. w/ FMUL: our proposed channel-wise condition method in Eq. (8); w/o FMUL: we capture self-relation without conditional information in Eq. (6); w/ MUL: we replace the FMUL with directly element-wise multiplication on C and S; w/ FC: we replace the FMUL by using FC layer to fuse each temporal feature unit with each sentence representation; w/ CROG: In stead of using FMUL, we utilize a cross-gate (Feng et al., 2018) as condition. As shown in Table 4, we can find that w/ FMUL performs the best.

Step1
Frame-to-frame relation Step3 Step5 Frame-to-word relation Frame-to-word relation Frame-to-word relation Rectify focus from "tapes"to "white" Strenghthen the focus on the word "white"

Qualitative Results
To investigate how our rectification and modulation modules work step-by-step, we show one visualization example on Activity Caption dataset in Figure 3. As shown in the left part, we first visualize the frame-to-word relation learned from the rectification module in different reasoning steps. At the first step, 2-5th frames have similar word-related attention which focus on the same words "tapes" and "tape". Although the 2th frame has the similar visual appearance like the 3-5th frames, the people tapes the red tape, not the mentioned "white" tape. With the step goes on, the rectification module adjusts the attention of previous step from "tape" to "white", leading to distinguish the 2th frame from the 3-5th. At step 5, the frame-to-word relations are more distinguishable and the attention on the target frames is focused more on the word "white". It demonstrates that our rectification helps model rectify the attention weights for better grounding the segment boundaries. In the right part, we visualize the attention weights on frame-to-frame relation utilizing softmax function. Similar to the frame-to-word relation, in the first step, the 2th frame is taken as a noisy frame which disturbs the frame-wise correlating. Thanks to the rectification module, with the reasoning step goes on, the weight of noisy frame is getting smaller and our modulation module can better capture the temporal relation referring to the matched words. To qualitatively validate the effectiveness of our method, we also show some qualitative examples from three datasets in Figure 4, where our model provides more precisely video segment boundaries.

Conclusion
In this paper, we propose a deep multi-step rectification-modulation network (RMN) for temporal sentence localization in videos. Different from previous single-step methods, we utilize the initial multimodal features as global information flows to correct the attention errors from previous reasoning step in the rectification module. In the modulation module, we modulate the temporal relation among video frames referring to sentence semantics for better associating and composing sentence-related video contents over time. With multiple such rectification-modulation layers cascaded in depth, our model can reasoning the matched video segment according to the selected words from the given sentence query step-by-step. Extensive experiments on three real-world datasets validate the effectiveness of our method.