Video-aided Unsupervised Grammar Induction

We investigate video-aided grammar induction, which learns a constituency parser from both unlabeled text and its corresponding video. Existing methods of multi-modal grammar induction focus on grammar induction from text-image pairs, with promising results showing that the information from static images is useful in induction. However, videos provide even richer information, including not only static objects but also actions and state changes useful for inducing verb phrases. In this paper, we explore rich features (e.g. action, object, scene, audio, face, OCR and speech) from videos, taking the recent Compound PCFG model as the baseline. We further propose a Multi-Modal Compound PCFG model (MMC-PCFG) to effectively aggregate these rich features from different modalities. Our proposed MMC-PCFG is trained end-to-end and outperforms each individual modality and previous state-of-the-art systems on three benchmarks, i.e. DiDeMo, YouCook2 and MSRVTT, confirming the effectiveness of leveraging video information for unsupervised grammar induction.


Introduction
Constituency parsing is an important task in natural language processing, which aims to capture syntactic information in sentences in the form of constituency parsing trees. Many conventional approaches learn constituency parser from humanannotated datasets such as Penn Treebank (Marcus et al., 1993). However, annotating syntactic trees by human language experts is expensive and timeconsuming, while the supervised approaches are limited to several major languages. In addition, the treebanks for training these supervised parsers are small in size and restricted to the newswire domain, thus their performances tend to be worse Figure 1: Examples of video aided unsupervised grammar induction. We aim to improve the constituency parser by leveraging aligned video-sentence pairs. when applying to other domains (Fried et al., 2019). To address these issues, recent approaches (Shen et al., 2018b;Jin et al., 2018;Drozdov et al., 2019;Kim et al., 2019) design unsupervised constituency parsers and grammar inducers, since they can be trained on large-scale unlabeled data. In particular, there has been growing interests in exploiting visual information for unsupervised grammar induction because visual information can capture important knowledge required for language learning that is ignored by text (Gleitman, 1990;Pinker and MacWhinney, 1987;Tomasello, 2003). This task aims to learn a constituency parser from raw unlabeled text aided by its visual context.
Previous methods (Shi et al., 2019;Kojima et al., 2020;Zhao and Titov, 2020;Jin and Schuler, 2020) learn to parse sentences by exploiting object information from images. However, images are static and cannot present the dynamic interactions among visual objects, which usually correspond to verb phrases that carry important information. There-fore, images and their descriptions may not be fullyrepresentative of all linguistic phenomena encountered in learning, especially when action verbs are involved. For example, as shown in Figure 1(a), when parsing a sentence "A squirrel jumps on stump", a single image cannot present the verb phrase "jumps on stump" accurately. Moreover, as shown in Figure 1(b), the guitar sound and the moving fingers clearly indicate the speed of music playing, while it is impossible to present only with a static image as well. Therefore, it is difficult for previous methods to learn these constituents, as static images they consider lack dynamic visual and audio information.
In this paper, we address this problem by leveraging video content to improve an unsupervised grammar induction model. In particular, we exploit the current state-of-the-art techniques in both video and audio understanding, domains of which include object, motion, scene, face, optical character, sound, and speech recognition. We extract features from their corresponding state-of-the-art models and analyze their usefulness with the VC-PCFG model (Zhao and Titov, 2020). Since different modalities may correlate with each other, independently modeling each of them may be sub-optimal. We also propose a novel model, Multi-Modal Compound Probabilistic Context-Free Grammars (MMC-PCFG), to better model the correlation among these modalities.
Experiments on three benchmarks show substantial improvements when using each modality of the video content. Moreover, our MMC-PCFG model that integrates information from different modalities further improves the overall performance. Our code is available at https://github.com/ Sy-Zhang/MMC-PCFG.
The main contributions of this paper are: • We are the first to address video aided unsupervised grammar induction and demonstrate that verb related features extracted from videos are beneficial to parsing.
• We perform a thorough analysis on different modalities of video content and propose a model to effectively integrate these important modalities to train better constituency parsers.
• Experiments results demonstrate the effectiveness of our model over the previous state-ofthe-art methods.

Background and Motivation
Our model is motivated by C-PCFG (Kim et al., 2019) and its variant of the image-aided unsupervised grammar induction model, VC-PCFG (Zhao and Titov, 2020). We will first review the evolution of these two frameworks in Sections 2.1-2.2, and then discuss their limitations in Section 2.3.

Compound PCFGs
A probabilistic context-free grammar (PCFG) in Chomsky normal form can be defined as a 6-tuple (S, N , P, Σ, R, Π), where S is the start symbol, N , P and Σ are the set of nonterminals, preterminals and terminals, respectively. R is a set of production rules with their probabilities stored in Π, where the rules include binary nonterminal expansions and unary terminal expansions. Given a certain number of nonterminal and preterminal categories, a PCFG induction model tries to estimate rule probabilities. By imposing a sentence-specific prior on the distribution of possible PCFGs, the compound PCFG model (Kim et al., 2019) uses a mixture of PCFGs to model individual sentences in contrast to previous models (Jin et al., 2018) where a corpus-level prior is used. Specifically in the generative story, the rule probability π r is estimated by the model g with a latent representation z for each sentence σ, which is in turn drawn from a prior p(z): The probabilities for the CFG initial expansion rules S → A, nonterminal expansion rules A → B C and preterminal expansion rules T → w can be estimated by calculating scores of each combination of a parent category in the left hand side of a rule and all possible child categories in the right hand side of a rule: , (2) where A, B, C ∈ N , T ∈ P, w ∈ Σ, w and u vectorial representations of words and categories, and f t and f s are encoding functions such as neural networks.
Optimization of the PCFG induction model usually involves maximizing the marginal likelihood of a training sentence p(σ) for all sentences in a corpus. In the case of compound PCFGs: where t is a possible binary branching parse tree of σ among all possible trees T under a grammar G. Since computing the integral over z is intractable, log p θ (σ) can be optimized by maximizing its evidence lower bound ELBO(σ; φ, θ): where q φ (z|σ) is a variational posterior, a neural network parameterized with φ. The sample log likelihood can be computed with the inside algorithm, while the KL term can be computed analytically when both prior p(z) and the posterior approximation q φ (z|σ) are Gaussian (Kingma and Welling, 2014).

Visualy Grounded Compound PCFGs
The visually grounded compound PCFGs (VC-PCFG) extends the compound PCFG model (C-PCFG) by including a matching model between images and text. The goal of the vision model is to match the representation of an image v to the representation of a span c in a parse tree t of a sentence σ. The word representation h i for the ith word is calculated by a BiLSTM network. Given a particular span c = w i , . . . , w j (0 < i < j ≤ n)], we then compute its representation c. We first compute the probabilities of its phrasal labels {p(k|c, σ)|1 ≤ k ≤ K, K = |N |}, as described in Section 2.1. The representation c is the sum of all label-specific span representations weighted by the probabilities we predicted: Finally, the matching loss between a sentence σ and an image representation v can be calculated as a sum over all matching losses between a span and the image representation, weighted by the marginal of a span from the parser: where h img (c, v) is a hinge loss between the distances from the image representation v to the matching and unmatching (i.e. sampled from a different sentence) spans c and c , and the distances from the span c to the matching and unmatching (i.e. sampled from a different image) image representations v and v : where is a positive margin, and the expectations are approximated with one sample drawn from the training data. During training, ELBO and the image-text matching loss are jointly optimized.

Limitation
VC-PCFG improves C-PCFG by leveraging the visual information from paired images. In their experiments (Zhao and Titov, 2020), comparing to C-PCFG, the largest improvement comes from NPs (+11.9% recall), while recall values of other frequent phrase types (VP, PP, SBAR, ADJP and ADVP) are fairly similar. The performance gain on NPs is also observed with another multi-modal induction model, VG-NSL (Shi et al., 2019;Kojima et al., 2020). Intuitively, image representations from image encoders trained on classification tasks very likely contain accurate information about objects in images, which is most relevant to identifying NPs 1 . However, they provide limited information for phrase types that mainly involve action and change, such as verb phrases. Representations of dynamic scenes may help the induction model to identify verbs, and also contain information about the argument structure of the verbs and nouns based on features of actions and participants extracted from videos. Therefore, we propose a model that induces PCFGs from raw text aided by the multi-modal information extracted from videos, and expect to see accuracy gains on such places in comparison to the baseline systems.
of purely relying on object information from images, we generalize VC-PCFG into the video domain, where multi-modal video information is considered. We first introduce the video representation in Section 3.1. We then describe the procedure for matching the multi-modal video representation with each span in Section 3.2. After that we introduce the training and inference details in Section 3.3.

Video Representation
A video contains a sequence of frames, denoted where v i represents a frame in a video and L 0 indicates the total number of frames. We extract video representation from M models trained on different tasks, which are called experts. Each expert focuses on extracting a sequence of features of one type. In order to project different expert features into the same dimension, their feature sequences are feed into linear layers (one per expert) with same output dimension. We denote the outputs of the mth expert after projection as where f m i and L m represent the ith feature and the total number of features of the mth expert, respectively.
A simple method would average each feature along the temporal dimension and then concatenating them together. However, this would ignore the relations among different modalities and the temporal ordering within each modality. In this paper, we use a multi-modal transformer to collect video representations Lei et al., 2020).
The multi-modal transformer expects a sequence as input, hence we concatenate all feature sequences together and take the form: Each transformer layer has a standard architecture and consists of multi-head self-attention module and a feed forward network (FFN). Since this architecture is permutation-invariant, we supplement it with expert type embeddings E and positional encoding P that are added to the input of each attention layer. The expert type embeddings indicate the expert type for input features and take the form: where e m is a learned embedding for the mth expert. The positional encodings indicate the location of each feature within the video and take the form: (10) where fixed encodings are used (Vaswani et al., 2017). After that, we collect the output of transformer that corresponds to the averaged features as the final video representation, i.e., Ψ = {ψ i avg } M i=1 . In this way, we can learn more effective video representation by modeling the correlations of features from different modalities and different timestamps.

Video-Text Matching
To compute the similarity between a video V and a particular span c, a span representation c is obtained following Section 2.2 and projected to M separate expert embeddings via gated embedding modules (one per expert) (Miech et al., 2018): where i is the index of expert, are learnable parameters, sigmoid is an elementwise sigmoid activation and • is the element-wise multiplication. We denote the set of expert embed- The video-span similarity is computed as following, where {u i } M i=1 are learned weights. Given Ξ , an unmatched span expert embeddings of Ψ, and Ψ , an unmatched video representation of Ξ, the hinge loss for video is given by: where is a positive margin. Finally the video-text matching loss is defined as: Noted that s vid can be regarded as a generalized form of s img in Equation 6, where features from different timestamps and modalities are considered.

Training and Inference
During training, our model is optimized by the ELBO and the video-text matching loss: where α is a hyper-parameter balancing these two loss terms and Ω is a video-sentence pair.
During inference, we predict the most likely tree where µ φ (σ) is the mean vector of the variational posterior q φ (z|σ) and t * can be obtained using the CYK algorithm (Cocke, 1969;Younger, 1967;Kasami, 1966

Evaluation
Following the evaluation practice in Zhao and Titov (2020), we discard punctuation and ignore trivial single-word and sentence-level spans at test time.
The gold parse trees are obtained by applying a state-of-the-art constituency parser, Benepar (Kitaev and Klein, 2018), on the testing set. All models are run 4 times for 10 epochs with different random seeds. We evaluate both averaged corpus-level F1 (C-F1) and averaged sentence-level F1 (S-F1) numbers as well as their standard deviations.

Expert Features
In order to capture the rich content from videos, we extract features from the state-of-the-art models of different tasks, including object, action, scene, sound, face, speech, and optical character recognition (OCR). For object and action recognition, we explore multiple models with different architectures and pre-trained dataset. Details are as follows: Object features are extracted by two models: ResNeXt-101 (Xie et al., 2017), pre-trained on Instagram hashtags (Mahajan et al., 2018) and finetuned on ImageNet (Krizhevsky et al., 2012), andSENet-154 (Hu et al., 2018), trained on ImageNet. These datasets include images of common objects, such as, "cock", "kite", and "goose", etc. We use the predicted logits as object features for both models, where the dimension is 1000.  OCR features are extracted by two steps: characters are first recognized by combining text detector Pixel Link (Deng et al., 2018) and text recognizer SSFL . The characters are then converted to word embeddings through word2vec (Mikolov et al., 2013) as the final OCR features, where the feature dimension is 300.
Face features are extracted by combining face de-tector SSD (Liu et al., 2016) and face recognizer ResNet50 (He et al., 2016). The feature dimension is 512. Speech features are extracted by two steps: transcripts are first obtained via Google Cloud Speech to Text API. The transcripts are then converted to word embeddings through word2vec (Mikolov et al., 2013) as the final speech features, where the dimension is 300.

Implementation Details
We keep sentences with fewer than 20 words in the training set due to the computational limitation. After filtering, the training sets cover 99.4%, 98.5% and 97.1% samples of their original splits in DiDeMo, YouCook2 and MSRVTT.
We train baseline models, C-PCFG and VC-PCFG, with same hyper parameters suggested in Kim et al. (2019); Zhao and Titov (2020). Our MMC-PCFG is composed of a parsing model and a video-text matching model. The parsing model has the same parameters as VC-PCFG (please refer to their paper for details). For video-text matching model, all extracted expert features are projected to 512-dimensional vectors. The transformer has 2 layers, a dropout probability of 10%, a hidden size of 512 and an intermediate size of 2048. We select the top-2000 most common words as vocabulary for all datasets. All the baseline methods and our models are optimized using Adam (Kingma and Ba, 2015) with the learning rate set to 0.001, β 1 = 0.75 and β 2 = 0.999. All parameters are initialized with Xavier uniform initializer (Glorot and Bengio, 2010). The batch size is set to 16.
Due to the long video durations, it is infeasible to feed all features into the multi-modal transformer. Therefore, each feature from object, motion and scene categories is partitioned into 8 chunks and then average-pooled within each chunk. For features from other categories, global average pooling is applied. In this way, the coarse-grained temporal information is preserved. Noted that some videos do not have audio and some videos do not have detected faces or text characters. For these missing features, we pad them with zeros. All the aforementioned expert features are obtained from Albanie et al. (2020).

Main Results
We evaluate the proposed MMC-PCFG approach on three datasets, and compare it with recently proposed state-of-the-art methods, C-PCFG (Kim et al., 2019) and VC-PCFG (Zhao and Titov, 2020). The results are summarized in Table 1. The values high-lighted by bold and italic fonts indicate the top-2 methods, respectively. All results are reported in percentage (%). LBranch, RBranch and Random represent left branching trees, right branching trees and random trees, respectively. Since VC-PCFG is originally designed for images, it is not directly comparable with our method. In order to allow VC-PCFG to accept videos as input, we average video features in the temporal dimension first and then feed them into the model. We evaluate VC-PCFG with 10, 7, and 10 expert features for DiDeMo, YouCook2 and MSRVTT, respectively. In addition, we also include the concatenated averaged features (Concat). Since object and action categories involve more than one expert, we directly use experts' names instead of their categories in Table 1.
Overall performance comparison. We first compare the overall performance, i.e., C-F1 and S-F1, among all models, as shown in Table 1. The right branching model serves as a strong baseline, since English is a largely right-branching language. C-PCFG learns parsing purely based on text. Compared to C-PCFG, the better overall performance of VC-PCFG demonstrates the effectiveness of leveraging video information. Compared within VC-PCFG, concatenating all features together may not even outperform a model trained on a single expert (R2P1D v.s. Concat in DiDeMo and MSRVTT). The reason is that each expert is learned independently, where their correlations are not considered. In contrast, our MMC-PCFG outperforms all baselines on C-F1 and S-F1 in all datasets. The superior performance indicates that our model can leverage the benefits from all the experts 2 . Moreover, the superior performance over Concat demonstrates the importance of modeling relations among different experts and different timestamps.
Performance comparison among different phrase types. We compare the models' recalls on top-3 frequent phrase types (NP, VP and PP). These three types cover 77.4%, 80.1% and 82.4% spans of gold trees on DiDeMo, YouCook2 and MSRVTT, respectively. In the following, we compare their performance on DiDeMo, as shown in Table 1   with a single expert, we find that object features (ResNeXt and SENet) achieve top-2 recalls on NPs, while action features (I3D, R2P1D and S3DG) achieve the top-3 recalls on VPs and PPs. It indicates that different experts help parser learn syntactic structures from different aspects. Meanwhile, action features improve C-PCFG 3 on VPs and PPs by a large margin, which once again verifies the benefits of using video information.
Comparing our MMC-PCFG with VC-PCFG, our model achieves the top-2 recall and is smaller in variance in NP, VP and PP. It demonstrates that our model can take the advantages of different experts and learn consistent grammar induction. 3 The low performance of C-PCFG on DiDeMo in terms of VP recall may be caused by it attaching a high attaching PP to the rest of the sentence instead of the rest of the verb phrase, which breaks the whole VP. For PPs, C-PCFG attaches prepositions to the word in front, which may be caused by confusion between prepositions in PPs and phrasal verbs.

Ablation Study
In this section, we conduct several ablation studies on DiDeMo, shown in Figures 2-4. All results are reported in percentage (%).
Performance comparison over constituent length. We first demonstrate the model performance for constituents at different lengths in Figure 2. As constituent length becomes longer, the recall of all models (except RBranch) decreases as expected (Kim et al., 2019;Zhao and Titov, R   2020). MMC-PCFG outperforms C-PCFG and VC-PCFG under all constituent lengths. We further illustrate the label distribution over constituent length in Figure 3. We find that approximately 98.1% of the constituents have fewer than 9 words and most of them are NPs, VPs and PPs. This suggests that the improvement on NPs, VPs and PPs can strongly affect the overall performance.
Consistency between different models. Next, we analyze the consistency of these different models.
The consistency between two models is measured by averaging sentence-level F1 scores over all possible pairings of different runs 4 (Williams et al., 2018). We plot the consistency for each pair of models in Figure 4 and call it consistency matrix.
Comparing the self F1 of all the models (the diagonal in the matrix), R2P1D has the highest score, suggesting that R2P1D is the most reliable feature that can help parser to converge to a specific grammar. Comparing the models trained with different single experts, ResNeXt v.s. SENet reaches the highest non-self F1, since they are both object features trained on ImageNet and have similar effects to the parser. We also find that the lowest non-self F1 comes from Audio v.s. I3D, since they are extracted from different modalities (video v.s. sound). Compared with other models, our model is most consistent with R2P1D, indicating that R2P1D contributes most to our final prediction.
Contribution of different modalities. We also evaluate how different modalities contribute to the performance of MMC-PCFG. We divide current experts into three groups, video (objects, action, scene and face), audio (audio) and text (OCR and ASR). By ablating one group during training, we find that the model without video experts has the 4 Different runs represent models trained with different seeds.   largest performance drops (see Table 2). Therefore, videos contribute most to the performance among all modalities.

Qualitative Analysis
In Figure 5, we visualize a parse tree predicted by the best run of SENet154, I3D and MMC-PCFG. We can observe that SENet identifies all NPs but fails at the VP. I3D correctly predicts the VP but fails at recognizing a NP, "the man". Our MMC-PCFG can take advantages of all experts and produce the correct prediction.

Related Work
Grammar Induction Grammar induction and unsupervised parsing has been a long-standing problem in computational linguistics (Carroll and Charniak, 1992). Recent work utilized neural networks in predicting constituency structures with no supervision (Shen et al., 2018a;Drozdov et al., 2019;Shen et al., 2018b;Kim et al., 2019;Jin et al., 2019a) and showed promising results. In addition to learning purely from text, there is a growing interest to use image information to improve accuracy of induced constituency trees (Shi et al., 2019;Kojima et al., 2020;Zhao and Titov, 2020;Jin and Schuler, 2020). Different from previous work, our work improves the constituency parser by using videos containing richer information than images. Video-Text Matching Video-text matching has been widely studied in various tasks, such as video retrieval (Liu et al., 2019;, moment localization with natural language (Zhang et al., 2019(Zhang et al., , 2020 and video question and answering (Xu et al., 2017;Jin et al., 2019b). It aims to learn video-semantic representation in a joint embedding space. Recent works (Liu et al., 2019; focus on learning video's multi-modal representation to match with text. In this work, we borrow this idea to match video and textual representations.

Conclusion
In this work, we have presented a new task referred to as video-aided unsupervised grammar induction. This task aims to improve grammar induction models by using aligned video-sentence pairs as an effective way to address the limitation of current image-based methods where only object information from static images is considered and important verb related information from vision is missing. Moreover, we present Multi-Modal Compound Probabilistic Context-Free Grammars (MMC-PCFG) to effectively integrate video features extracted from different modalities to induce more accurate grammars. Experiments on three datasets demonstrate the effectiveness of our method. Songyang Zhang, Jinsong Su, and Jiebo Luo. 2019. Exploiting temporal relationships in video moment localization with natural language. In ACM Multimedia.
Luowei Zhou, Chenliang Xu, and Jason J Corso. 2018. Towards automatic learning of procedures from web instructional videos. In AAAI. Tables   Method  NP  VP  PP  SBAR  ADJP  ADVP  C-