Guiding the Flowing of Semantics: Interpretable Video Captioning via POS Tag

In the current video captioning models, the video frames are collected in one network and the semantics are mixed into one feature, which not only increase the difficulty of the caption decoding, but also decrease the interpretability of the captioning models. To address these problems, we propose an Adaptive Semantic Guidance Network (ASGN), which instantiates the whole video semantics to different POS-aware semantics with the supervision of part of speech (POS) tag. In the encoding process, the POS tag activates the related neurons and parses the whole semantic information into corresponding encoded video representations. Furthermore, the potential of the model is stimulated by the POS-aware video features. In the decoding process, the related video features of noun and verb are used as the supervision to construct a new adaptive attention model which can decide whether to attend to the video feature or not. With the explicit improving of the interpretability of the network, the learning process is more transparent and the results are more predictable. Extensive experiments demonstrate the effectiveness of our model when compared with state-of-the-art models.


Introduction
Video captioning, which transforms the semantic information in a video to a natural statement, has received wide attention recently. The series of scenes (both related and unrelated) in video frames bring a huge challenge for the task of video captioning. Therefore, mastering the ability to process the correlated and irrelevant semantic information can improve the performance and interpretability of the model of video captioning.
The classical deep learning based video captioning methods (Venugopalan et al., 2015a,b)  encoder-decoder architecture which extracts all the semantic information in one stream and fuses them in a single feature. In the architecture, the transformation of the semantics to words prediction is uncertainty. Some other recent studies Wu et al., 2018) attempt to improve the interpretability of the transformation process. But the flowing of different semantic information in the network streams and the activated neurons in network are still ambiguous. To address these issues, the first is to take the discrimination of different semantic information into consideration. According to (Horoufchin et al., 2018;Fargier and Laganaro, 2015), nouns and verbs typically describe concrete objects and actions which can recruit the canonical neurons system to activate corresponding representation patterns in brain. This indicates that differential neurons in brain are activated for lexical selection of action and object words. Therefore, the part of speech (POS) tag can be applied to guide the flowing of different semantics into the corresponding network streams.
A POS tag is the category of words which has similar grammatical properties. The words assigned to the same POS tag generally reflect similar properties within the grammatical structure of sentence. Fig.1 (a) shows a video and its caption "a man is lifting weights". The nouns "man", "weights" and the verb "lifting" are belonging to different POS tags and referring to different semantic information in video. Moreover, in Fig. 1  (b), the caption of video is "a girl is talking on the phone to another girl". The words of "girl", "talking" and "phone" distinctly have their corresponding visual signals in the video, but the others are uncertain. This example indicates that the nouns of objects and verbs of actions are generally referred to visual words. Another property of the POS tag is that when given a fixed sentence, the POS tags of all the words in the sentence are fixed. It ensures the reliability of the POS tag in helping to extract and guide corresponding semantics.
According to Merity et al. (Merity et al., 2016), the prediction of the next word not always need to attend the visual feature. The gradients from non-visual words could mislead and diminish the effectiveness of the visual signal in guiding caption generation. To this end, Lu et al. (Lu et al., 2016) proposed a "visual sentinel", which applies the hidden state in LSTM as supervision to adaptively decide if it is necessary to input visual feature to the language model when generating the next word. However, the supervision information from the hidden state isn't credible which cannot make sure it contains the corresponding visual decision signals. According to the properties of the POS tag, the nouns and verbs can be applied as the supervision to distinguish whether the remaining words in the sentence are visual words or not.
In this paper, to explicitly improve the interpretability of video captioning model, we propose an Adaptive Semantic Guidance Network (ASGN), which instantiates the whole video semantics by part of speech (POS) tags to different POS-aware semantics. At first, a POS-aware semantic guider is proposed. It predicts the POS tags of the words in descriptions, and guides different POS-aware semantic information of video into corresponding network streams by the predicted POS tags. In this process, the specific CNN neurons are activated under the supervision of POS tags and the whole visual semantics are parsed into POS tags related video features. Moreover, de-pending on the POS-aware video features, a new adaptive attention operation is introduced. The video features related to noun and verb are used as a supervision to get a sentinel gate, which decides how much the attended feature can be imported into the decoder LSTM when generating the next word. A reinforcement learning (RL) method is applied to optimize our model which further demonstrates the validity of our method on the video captioning task.
The main contributions of this paper are: -Designing a POS-aware semantic guider to predict the POS tags of words and guide different semantic information of video into corresponding network streams. Under the supervision of POS tags, the CNN neurons are selectively activated and aware of the related POS tags, so that the whole video semantics are parsed into the corresponding POS-aware video features.
-Depending on the noun and verb instantiated video features, a new adaptive attention model is constructed to decide how much the visual feature is imported into the decoder.
-Due to the guiding of POS tags, the flowing of the type of semantic in which network stream can be easily clarified. With the supervision of the noun and verb related features, the judgment of the predicted word is visual word or not is more reasonable. These make the learning process more interpretable while achieving state-of-the-art performance.

Related Works
Here, we first review the recent implements of the POS tag in computer vision, then review the most relevant works on video description task like attention-based methods and interpretable improved models.
The POS tag has been received attention in some computer vision tasks, like visual question answering (Wang et al., 2018b) and image captioning (A et al., 2019). He et al. (He et al., 2017) utilized the POS tag of each word to determine whether it is essential to input image representation into the word generator. Wang et al. (Wang et al., 2018b) exploited the POS tag guided attention model in VQA to put more emphasis on the important words such as nouns, verbs and adjectives. All these methods realized the importance  Figure 2: The architecture of our Adaptive Semantic Guidance Network (ASGN) which can shunt the required semantic information into different video features when generating every word in caption. The features {v 1 , v 2 , . . . , v Ns } are extracted by the CNN modules. is the hadamard product module. g na is the new adaptive attention model. A single-layer LSTM is set as the POS tag decoder and a two-layer stacked LSTM is set as the language model. The BOS and EOS denote the begin-of-sentence and end-of-sentence, respectively.
of the POS tag in the linguistic computer vision. However, they ignore the important property of POS tag, which relates to different visual semantics with different types of POS tag. Attention-based methods have been widely used in visual captioning models. Yao et al. (Yao et al., 2015) considered the temporal structure of video and proposed a temporal attention mechanism to generate descriptions. Lu et al. (Lu et al., 2016) proposed an adaptive attention model in image captioning which can decide either to look at the image or to rely on the context of sentence to generate the next word.
Because of the highly nonlinearity and unclear working mechanism of neural networks, the operational processes of neural networks are always treated as black-box processes. For the video description task, some researchers Wu et al., 2018) attempted to improve the interpretability of video description models. Dong et al.  interpreted the learned features of each neuron by a wide range of visual concepts in the video description task. Wu et al. (Wu et al., 2018) considered both the motion information and the sentence semantic structure with an attentive structured localization mechanism to enhance the captioning model's interpretability.
In this paper, we find the POS tag can be employed as a supervision to process irrelevant or relevant semantic information in video description task. Under the supervision of POS tag, an Adaptive Semantic Guidance Network (ASGN) is proposed to guide different POS-aware semantic information of video into corresponding network streams. Moreover, the video features related to noun and verb are used to get a sentinel gate which can decide how much the attended feature can be imported to the decoding process.

Our Method
In this section, we describe our ASGN in detail. First, the POS-aware semantic guider with the POS tag learning model is introduced. Next, an adaptive attention model which constructs a new sentinel gate will be presented. Finally, the description generator and its learning methods are introduced.

POS-Aware Semantic Guider
The key of the proposed POS-aware semantic guider is the POS tag, which is used to guide different semantic information of video into corresponding network streams. Supposing that we have a video described by a textual sentence S, which consists of T words. The POS tag set of each word in S is defined as P. To achieve this, a POS tag learning model is designed to predict the POS tags of the words in description.
We refer to (Al-Rfou et al., 2013) and annotate the POS tags of captions in training set by the polyglot toolkit 1 . Polyglot toolkit is a natural language pipeline that supports massive multilingual applications, including the POS tag identification. According to the setting of the polyglot toolkit, there are seventeen categories of POS tags: noun (NOUN), verb (VERB), adjective (ADJ), adposition (ADP), adverb (ADV), auxiliary verb (AUX), coordinating conjunction (CONJ), determiner (DET), interjection (INTJ), numeral (NUM), particle (PART), pronoun (PRON), proper noun (PROPN), punctuation (PUNCT), subordinating conjunction (SCONJ), symbol (SYM) and other unknown types (X). Different POS tags have different descriptions or embellishments. Following the universal set of this toolkit, the number of POS tag categories is defined as N s = 17.
Based on these POS tag categories, an encoderdecoder framework is proposed to predict the specific POS tag of each word in sentences, which can be seen in Fig. 2. Specifically, the video feature V is concatenated from the extractions of the 2D CNN and 3D CNN. Then, V is flowed into N s semantic guiding CNN modules. Each CNN module relates to a corresponding POS tag category. The outputs of these CNN modules are N s video features v i , where i = 1, . . . , N s . The meanpooled featurev = 1 Ns i=1 v i and the sentence S = {s 0 , s 1 , . . . , s T } are taken as the inputs to the POS tag decoder LSTM p , where s 0 is defined as the begin-of-sentence (BOS). All the POS tags of words in caption are sequentially generated by LSTM p . The hidden state h p t is updated at time step t ∈ {0, . . . , T } through: where W p e ∈ R N h ×V is the word embedding matrix, N h is the length of hidden state and V denotes the vocabulary size of the corresponding dataset's text library.
Given the ground-truth POS tags which are annotated to the corresponding sentence. Therefore, the associated POS tags of the k-th video V are P k = {p k 0 , . . . , p k T }. We define the POS tag learning loss as: where N is the number of training examples.
In the encoder-decoder framework, the POS tag probability vector p t is predicted at time step t. The mapping function is Softmax mixed function and FC is a full connection layer. The predicted POS tag has the maximum probability in p t . Similar to visual captioning, the learned POS-aware semantic guider applies each predicted word to predict the POS tag of the next word in testing.
The learned p t is used to guide the flowing of semantics from the N s CNN modules at time step t . The i-th CNN module is related to the i-th POS tag category. The outputs of the CNN modules are concatenated as where v i is generated by the specific neuron of the i-th CNN module. The POS tag vector p t is applied to activate the specific CNN module at the t-th step. The operation is implemented by a hadamard product module on the channel level: where V r t represents the video representation at time step t. Then the video feature V r t which contains relevant visual semantic information will be inputted into the language model at time step t.
Generally, the learned POS tag representations can be used to activate the specific neurons of the CNN encoder and parse the whole video representation to guide semantic information into corresponding feature representations at each time step. Compared with the normal CNN+LSTM model, our POS-aware semantic guider constructs a correlation between different types of POS-aware semantic and the corresponding CNN modules.

Adaptive Attention Model
Although, Lu et al. (Lu et al., 2016) proposed a sentinel to decide whether the predicted words are visual words or not. Their model is learning from the gradient of back propagation, which is still an ambiguous process. The video features related to different POS tags have different properties. For example, the video features related to noun and verb always have sufficient visual signals. According to the properties of POS tags, we propose a more credible adaptive spatial attention model to predict next word. Specifically, the related video features of noun and verb are applied as the supervision to get a sentinel gate. Through the sentinel gate, when generating the next word, we can distinguish its visual word or not and decide how much the attended feature can be imported into the decoder LSTM.
First, the video feature V r t is reshaped to V r t = [v r t1 , v r t2 , . . . , v r tm ], where m is the value of the width times height of V r t . Normal attention model is defined as g a (V r t , h d t−1 ). h d t−1 is generated by the description decoder LSTM d which is shown in Fig. 2. Formally, for the t-th time step, the attention part of the model g a is defined as follows: where W r ∈ R l×(Nc * Ns) , W hr ∈ R l×N h , W αr ∈ R l×1 are the transformation matrices that map the CNN feature and hidden state to the same size; N c is the channel size of v i ; b r ∈ R l and b αr ∈ R 1 are the model biases. α r t is the attention weight related to V r t . Second, to improve the structure of g a , our model learns to use the related video features of noun and verb as a supervision to distinguish whether the words in a sentence are visual words or not. All of the N s video features are extracted after the ReLU operation, and thus the values in these feature maps are not less than zero. It indicates that the value of the feature map can reflect the response activation of the corresponding CNN module. We assign the related feature maps of noun and verb as the reference and obtain a value where v n and v v are the related features of noun and verb from V r . Depending on the attended visual feature, a formulae is introduced to ascertain whether the current t word is a visual word or not: where c v t is the weight of visual word at time step t, p i t is i-th value in the t-th POS tag representation p t . Therefore, the weight of non-visual word is c nv t = 1 − c v t . In our method, the sentinel gate is defined as: .
The design of the sentinel gate can avoid the gate value of β t being too small or too large. Based on β t , the new adaptive attention model g na can decide how much the attended feature can be imported into the decoder LSTM as follows:

Description Generator
In the description generation stage, we adopt a stacked two-layer LSTM to generate captions, namely LSTM w and LSTM d . Following (Donahue et al., 2015), the first LSTM layer LSTM w is applied to encode the inputted sentence to enhance the textual context information of each word vector h w t . In the decoding stage, the encoded word vector h w t and the processed video feature a r t are taken as the inputs at t time step. The updating procedure from 0 to T of LSTM d is written below: where h d t is the current output of LSTM d at time step t, the h d −1 can be set as a null vector.

Description Generator Learning
Maximum Likelihood Estimation: Given the kth video V k and the associated sentence W k = {w k 0 , . . . , w k T }, the generator loss can be formulated as follows with the optimization of maximum likelihood estimation (MLE): where p(w k t |w k 0:t−1 , V k ) is obtained from a Softmax mixed function; λ is the hyper-parameter.
Policy Gradient Optimization: For a fair comparison with recent works, the policy gradient (PG) technique is adopted as the optimizer to training our model. The objective in learning is to minimize the negative expected reward of the complete sampled sentence W s = {w s 0 , . . . , w s T }: where r(W s ) is calculated by comparing sampled caption with the reference caption in the specified evaluation metric. Following the implementation in (Rennie et al., 2017), we apply a single Monte-Carlo sample to calculate the relative reward ∆r(W s ), which is computed by a baseline reward b. b is obtained by performing greedy decoding:

Dataset and Evaluation
We report the results of our method on the Youtube2Text (Guadarrama et al., 2013) and MSR-VTT  datasets. The Youtube2Text dataset contains 1970 YouTube video clips. According to the publicly provided splits (Venugopalan et al., 2015b), 1200 videos are used for training, 100 videos for validation and the rest are used for testing. MSR-VTT is the largest public dataset for video captioning up to now. We follow the public splits (Venugopalan et al., 2015a) and divide them into 6,513, 497 and 2,990 samples for training, validation and testing, respectively. We reserve the words that appear in the training set and yield two vocabularies which contain 12,182 and 16,630 words for Youtube2Text and MSR-VTT datasets, respectively.

Training Details
CNN Encoder: For the video representations, we use a 2D CNN and a 3D CNN as the CNN encoder collectively. The 3D CNN can operate all video frames as a whole, which ensures the extracted visual features contain all the semantic information. The 2D CNN has more efficient learning and representation capacity. The details of these two CNNs can be seen as follows: • 2D CNN We use ResNet-152  as the 2D CNN model. The feature map is taken from the res5c layer (2,048dim).
The equally-spaced 16 and 32 frames are sampled from one video clip for Youtube2Text and MSR-VTT, respectively. We perform a mean operation among all the 2D CNN features. The representation V of each video is composed by a concatenation of the 3D CNN feature and the 2D CNN feature. Then, the feature map V is processed by N s semantic separating CNN modules. A residual block is adopted as the CNN module in our method. The hidden state dimension of the LSTM units is 1,024.
The Adam optimizer is adopted in training. We first train the POS tag learning model in the POSaware semantic guider. In the later training, the parameters of the POS tag learning model are fixed. The other parameters of our model is learned with MLE. λ is set to be 1. The maximum number of epochs of the MLE training is 30. The RL method is applied to optimize the MLE trained model with the CIDEr metric. At each epoch, the validation set is used to evaluate the training model, and the best CIDEr score model is selected for the final testing. All of our experiments are implemented with Pytorch (Paszke et al., 2017) and are conducted on a Titan X GPU with 12G memory.
In caption testing, the beam search is adopted for caption generation. The search size is set to be 5 in experiments.

Ablation Study
We perform the ablation studies on the Youtube2Text and MSR-VTT datasets for our video captioning model. The results are shown in Table 1. ASGN which imports video feature into different network streams without the POS tag guidance is adopted as the baseline model. ASGN+L predicts the POS tag of each word and applies the POS tag to guide the semantic separation. "L" denotes the POS-aware semantic guider. ASGN+LA adds an attention model proposed by Lu et al. (Lu et al., 2016). As a comparison, our proposed new adaptive attention model is introduced to ASGN+LNA. It can be seen that the ASGN+LNA achieves the best performances in all metrics, which indicates our proposed sentinel gate is more effectiveness and reasonable to decide the quantity of attended feature to the decoder LSTM. Compared with the baseline model, the introduction of POS tag in ASGN+L, which activates the specific neurons and parses the whole visual representation of video, can assign appropriate POS-aware semantic information and achieve better performance. Comparing with the results of MLE-based and RL-based methods, the RL method can improve the performance of MLE-based model by significant margins across all metrics.

Quantitative Analysis
In Table 2 and Table 3, we compare our ASGN+LNA model with the state-of-the-art models on the Youtube2Text and MSR-VTT datasets. Following the operation of (Gan et al., 2016;Pasunuru and Bansal, 2017), ASGN+LNA is the average ensemble of 5 ASGN+LNA (RL) models trained with different initializations. From the results, our method achieves the competitive performance on the two datasets. Compared with the other interpretable improvement methods Wu et al., 2018), interpretability of our neural network is explicitly improved, and the performance of our model is more competitive.   Figure 4: Visualizations of the probabilities of each word and corresponding POS tag in caption.
We introduce the human evaluation from (Pasunuru and Bansal, 2017) for comparison between ASGN and ASGN+L models. The relevance measures how related is the generated caption w.r.t, to the video content is adopted as the metric. In Table 4, the results of 150 samples from the Youtube2Text test set are studied. It can be found that the proposal of the semantic guider significantly improves the semantic extraction ability of network.
Moreover, to better verify the reliability of the supervision of nouns and verbs, we add comparisons by adjusting the supervision to single nouns (N), single verbs (V), adjectives and adverbs (A+A), nouns and verbs (N+V), respectively. The results on the Youtube2Text are presented in Table 5. It can be found that the model under the supervision of V+N achieves the best performance. Compared to verbs, the model under the supervision of nouns is more reliable. Under the supervision of A+A, the results of the model indicate the words of adjectives or adverbs are not always related to visual signals.

Visualized Analysis
To examine the reliability of the POS tag prediction, the performance of the POS tag learning model is measured using accuracy (Acc) and B-N (N=2,3,4) over the test datasets of Youtube2Text and MSR-VTT. The results are shown in Fig. 3. The metric of Acc is to test the total predict performance and the metric of B-N is to test the continuity of prediction. To better illustrate the effectiveness of the model, we set the beam search size to be 1 which is the same with the training process. These results indicate that the POS tag learning model can provide reliable POS tag rep-resentations to guide the semantics' separation to corresponding network streams. Fig. 4 gives some results of generated captions and the corresponding probability of words and POS tags. In Fig. 4, we can see that the POS-aware semantic guider successfully guides the POS-aware semantic to the generation of captions. The POS-aware neurons have higher probability are activated to extract corresponding semantics to predict the relevant words at each time step. In the first example, the POS tags of the generated words with the highest probability have high probability as well. It demonstrates the improvement of interpretability of our model.
To further present the interpretability of our model, the neuron activations associated with the POS tags, the weights of the sentinel gate, the generated video captions, and the real POS tags of the captions are visualized in Fig. 5. Each element of these samples is illustrated along with the word prediction in sequence. Through these examples, the process of the POS-aware semantic information extraction and guidance in corresponding network stream can be visualized explicitly. From the illustration between the activated neuron and the truth POS tags of the captions. It can be found that the corresponding POS-aware neurons have high activations through time. For example, in Fig. 5 (b), the "DET", "NOUN" and "VERB" are precisely pointed to the POS tag of the words in "a man is flying into the water". It illustrates that the corresponding neurons of the POS tags are successfully activated at each time step. Moreover, the illustrations between the weights of the sentinel gate and the generated captions reveal that our adaptive attention model can effectively capture both the visual words and non-visual words.

Description:
A person is slicing a potato From Fig. 5 (a), the words of "person", "slicing" and "potato" are obviously visual words which have related visual signals in video, and our model successfully extracts their visual information.

Conclusion
In this paper, we have proposed an Adaptive Semantic Guidance Network (ASGN), which extracts as well as guides the POS-aware semantic information into the corresponding encoded visual representations under the supervision of POS tag. Moreover, a new sentinel gate is introduced to determine how much the attended feature can be imported into the decoding process. It indicates that interpretable improvement not only makes the learning process more transparent, but also gives model more space to explore. The promising performance and interpretability improved merits of our method demonstrate the effectiveness of the POS tag. Furthermore, the proposed ASGN has a good flexibility which can be employed to the other language and vision fields, such as image captioning, visual question answering, and so on.