Visually Grounded Compound PCFGs

Exploiting visual groundings for language understanding has recently been drawing much attention. In this work, we study visually grounded grammar induction and learn a constituency parser from both unlabeled text and its visual groundings. Existing work on this task (Shi et al., 2019) optimizes a parser via Reinforce and derives the learning signal only from the alignment of images and sentences. While their model is relatively accurate overall, its error distribution is very uneven, with low performance on certain constituents types (e.g., 26.2% recall on verb phrases, VPs) and high on others (e.g., 79.6% recall on noun phrases, NPs). This is not surprising as the learning signal is likely insufficient for deriving all aspects of phrase-structure syntax and gradient estimates are noisy. We show that using an extension of probabilistic context-free grammar model we can do fully-differentiable end-to-end visually grounded learning. Additionally, this enables us to complement the image-text alignment loss with a language modeling objective. On the MSCOCO test captions, our model establishes a new state of the art, outperforming its non-grounded version and, thus, confirming the effectiveness of visual groundings in constituency grammar induction. It also substantially outperforms the previous grounded model, with largest improvements on more `abstract' categories (e.g., +55.1% recall on VPs).


Introduction
Grammar induction is a task of finding latent hierarchical structure of language. As a fundamental problem in computational linguistics, it has been extensively studied for decades (Lari and Young, 1990;Carroll and Charniak, 1992;Clark, 2001;Klein and Manning, 2002). Recently, deep learning models have been shown very effective across NLP tasks and have also been applied to grammar induction, greatly advancing the area (Shen et al., 2018(Shen et al., , 2019Kim et al., 2019a,b;Jin et al., 2019). These neural grammar-induction approaches have been generally limited to relying on text, without considering learning signals from other modalities.
In contrast, the crucial aspect of natural language learning is that it is grounded in perceptual experiences (Barsalou, 1999;Fincher-Kiefer, 2001;Bisk et al., 2020). We thus anticipate improved language understanding by leveraging grounded learning. Promising results from grounded learning have been emerging in areas such as representation learning (Bruni et al., 2014;Kiela et al., 2018;Bordes et al., 2019). Typically, they use visual images as perceptual groundings of language and aim at improving continuous vector representations of language (e.g., word or sentence embeddings). In this work, we consider a more challenging problem: can visual groundings help us induce syntactic structure? We refer to this problem as visually grounded grammar induction. Shi et al. (2019) propose a visually grounded neural syntax learner (VG-NSL) to tackle the task. Specifically, they learn a parser from aligned imagesentence pairs (e.g., image-caption data), where each sentence describes visual content of the corresponding image. The parser is optimized via REIN-FORCE, where the reward is computed by scoring the alignment of images and constituents. While straightforward, matching-based rewards can, as we will discuss further in the paper, make the parser focus only on more local and short constituents (e.g., 79.6% recall on NPs) and to perform poorly on longer ones (e.g., 26.2% recall on VPs) (Shi et al., 2019). While for the former it outperforms the text-only grammar induction methods, for the latter it substantially underachieves. This may not be surprising, as it is not guaranteed that every constituent of a sentence has its visual representation in the aligned image; the reward signals can be noisy and insufficient to capture all aspects of phrase-structure syntax. Consequently, Shi et al. (2019) have to rely on language-specific inductive bias to obtain more informative reward signals. Another issue with VG-NSL is that the parser does not admit tractable estimation of the partition function and the posterior probabilities for constituent boundaries needed to compute the expected reward in closed form. Instead, VG-NSL relies on Monte Carlo policy gradients, potentially suffering from high variance.
To alleviate the first issue, we propose to complement the image-text alignment-based loss with a loss defined on unlabeled text (i.e., its loglikelihood). As re-confirmed with neural models in Shen et al. (2019) and Kim et al. (2019a), text itself can drive induction of rich syntactic knowledge, so additionally optimizing the parser on raw text can be beneficial and complementary to visual grounded learning. To resolve the second issue, we resort to an extension of probabilistic contextfree grammar (PCFG) parsing model, compound PCFG (Kim et al., 2019a). It admits tractable estimation of the posteriors, needed in the alignment loss, with dynamical programming and leads to a fully-differentiable end-to-end visually grounded learning. More importantly, the PCFG parser lets us complement the alignment loss with a language modeling objective.
Our key contributions can be summarized as follows: (1) we propose a fully-differentiable endto-end visually grounded learning framework for grammar induction; (2) we additionally optimize a language modeling objective to complement visually grounded learning; (3) we conduct experiments on MSCOCO (Lin et al., 2014) and observe that our model has a higher recall than VG-NSL for five out of six most frequent constituent labels. For example, it surpasses VG-NSL by 55.1% recall on VPs and by 48.7% recall on prepositional phrases (PPs). Comparing to a model trained purely via visually grounded learning, extending the loss with a language modeling objective improves the overall F1 from 50.5% to 59.4%.

Background and Motivation
Our model relies on compound PCFGs (Kim et al., 2019a) and generalizes the visually grounded gram-mar learning framework of Shi et al. (2019). We will describe the relevant aspects of both frameworks in Sections 2.1-2.2, and then discuss their limitations (Section 2.3).

Compound PCFGs
Compound PCFGs extend context-free grammars (CFGs) and, to establish notation, we start by briefly introducing them. A CFG is defined as a 5-tuple G = (S, N , P, Σ, R) where S is the start symbol, N is a finite set of nonterminals, P is a finite set of preterminals, Σ is a finite set of terminals, 2 and R is a set of production rules in the Chomsky normal form: PCFGs extend CPGs by associating each production rule r ∈ R with a non-negative scalar π r such that r:A γ π r = 1, i.e., the probabilities of production rules with the same left-hand-side nonterminal sum to 1. The strong context-free assumption hinders PCFGs and prevent them from being effective in the grammar induction context. Compound PCFGs (C-PCFGs) mitigate this issue by assuming that rule probabilities follow a compound probability distribution (Robbins, 1951): where p(z) is a prior distribution of the latent z, and g r (·; θ) is parameterized by θ and yields a rule probability π r . Depending on the rule type, g r (·; θ) takes one of these forms: , where u is a parameter vector, w N is a symbol embedding and N ∈ {S} ∪ N ∪ P. [·; ·] indicates vector concatenation, and f s (·) and f t (·) encode the input into a vector (parameters are dropped for simplicity).
A C-PCFG defines a mixture of PCFGs (i.e., we can sample a set of PCFG parameters by sampling a vector z). It satisfies the context-free assumption conditioned on z and thus admits exact inference for each given z. Learning with C-PCFGs involves maximizing the log-likelihood of every observed sentence w = w 1 w 2 . . . w n : where T G (w) consists of all parses of the sentence w under a PCFG G. Though for each given z the inner summation over parses can be efficiently computed using the inside algorithm (Baker, 1979), the integral over z makes optimization intractable. Instead, C-PCFGs rely on variational inference and maximize the evidence lower bound (ELBO): where q φ (z|w) is a variational posterior, a neural network parameterized with φ. The expected loglikelihood term is estimated via the reparameterization trick (Kingma et al., 2014); the KL term can be computed analytically when p(z) and q φ (z|w) are normally distributed.

Visually grounded neural syntax learner
The visually grounded neural syntax learner (VG-NSL) comprises a parsing model and an image-text matching model. The parsing model is an easyfirst parser (Goldberg and Elhadad, 2010). It builds a parse greedily in a bottom-up manner while at the same time producing a semantic representation for each constituent in the parse (i.e., its 'embedding'). The parser is optimized through REIN-FORCE (Williams, 1992). The reward encourages merging two adjacent constituents if the merge results in a constituent that is concrete, i.e., if its semantic representations is predictive of the corresponding image, as measured with a matching function. We omit details of the parser and how the semantic representations of constituents are computed, as they are not relevant to our approach, and refer the reader to Shi et al. (2019). However, as we will extend their image-text matching model, we explain this component of their approach more formally. In their work, this loss is used to learn the textual and visual representations. For every constituent c (i) of a sentence w (i) , they define the following triplet hinge loss: where is the matching function measuring similarity between the constituent representation c and the image representation v. The expectation is taken with respect to 'negative examples', c and v . In practice, for efficiency reasons, a single representation of an image v and a single representation of a constituent (span) c from another example in the same batch are used as the negative examples. Intuitively, an aligned image-constituent pair (c (i) , v (i) ) should score higher than an un- The total loss for an image-sentence pair (v (i) , w (i) ) is obtained by summing losses for all constituents in a tree t (i) , sampled from the parsing model (we write c (i) ∈ t (i) ): ( 3) In their work, training alternates between optimizing the parser using rewards (relying on image and text representations) and optimizing the imagetext matching model to refine image and text representations (relying on the fixed parsing model). Once trained, the parser can be directly applied to raw text, i.e., images are not used at test time.

Limitations of the VG-NSL framework
While straightforward, there are several practical issues inhibiting the visually grounded learning framework. First, contrastive learning implicitly assumes that every constituent of a sentence has its visual representation in the aligned image. However, it is not guaranteed in practice and would result in noisy reward signals. Besides, the loss in Equation 2 (and a similar component in the reward, see Shi et al. (2019)) focuses on constituents corresponding to short spans. Long spans, independently of their syntactic structure, tend to be sufficiently discriminative to distinguish the aligned image v (i) from an unaligned one. This implies that there is not much learning signal for such constituents. The tendency to focus on short spans and those more easily derivable from an image is evident from the results (Shi et al., 2019;Kojima et al., 2020). For example, their parser is accurate for noun phrases (recall 79.6%), which are often short for captions, but performs poorly on verb phrases (recall 26.2%) which have longer spans, more complex compositionally and also harder to predict from images (see our analysis in Section 4.3.2). While there may be ways to mitigate some of these issues, we believe that any image-text matching loss alone is unlikely to provide sufficient learning signal to accurately captures all aspects of syntax. Instead of resorting to language-specific inductive biases as done by Shi et al. (2019) (i.e., head-initial bias (Baker, 2008) of English), we propose to complement the image-text matching loss with the objective derived from the unaligned text (i.e., log-likelihood), jointly training a parser to both explain the raw language data and the alignment with images.
Moreover, their learning is likely to suffer from large variance in gradient estimation as their parser does not admit tractable estimation of the partition function, and thus they have to rely on sampling decisions. This will be even more of a problem if we would attempt to use it in the joint learning setup. Also note that similar parsing models do not yield linguistically-plausible structures when used in the conventional (i.e., non-grounded) grammarinduction set-ups (Williams et al., 2018;Havrylov et al., 2019).
In the next section, we will use compound PCFGs and describe an improved visually grounded learning framework that can tackle these issues neatly.

Visually grounded compound PCFGs
We use compound PCFGs (Kim et al., 2019a) and develop visually-grounded compound PCFGs (VC-PCFGs) within the contrastive learning framework. Instead of sampling a tree and computing a point estimate of the image-text matching loss, we can compute the expected image-text matching loss under a tree distribution and use end-to-end contrastive learning (Section 3.1). Since it is inefficient to compute constituent representations relying on the chart, we will introduce an additional textual representation model to encode constituents (Section 3.2). Moreover, VC-PCFGs let us additionally optimize a language modeling objective, complementing the visually grounded contrastive learning (Section 3.3).

End-to-end contrastive learning
In the visually grounded grammar induction framework, the parsing model is optimized through learning signals derived from the alignment of images and constituents, as scored by the image-text matching model. Denoting a set of image representations by V = {v (i) } and the corresponding set of sentences by W = {w (i) }, the image-text matching model is optimized via contrastive learning: We define s(v (i) , w (i) ) as the loss of aligning v (i) and w (i) . In VG-NSL, it is estimated via point estimation (see Equation 3). While in VC-PCFGs, given an aligned image-sentence pair (v, w), we compute the expected image-sentence matching loss under a tree distribution p θ (t|w), leading to an end-to-end contrastive learning: where h(c, v) is the hinge loss of aligning the unlabeled constituent c and the image v (defined in Equation 2). Minimizing the hinge loss encourages an aligned image-constituent pair to rank higher than any unaligned one. Expanding the right-hand side of Equation 5 s where p(c|w) is the conditional probability (i.e., marginal) of the span c given w. It can be efficiently computed with the inside algorithm and automatic differentiation (Eisner, 2016).

Span representation
Estimation of the expected image-text matching scores relies on span representations. Ideally, a span representation should encode semantics of a span with its computation guided by its syntactic structure (Socher et al., 2013). The reliance on the predicted tree structure will result in propagating learning signals derived from the alignment of images and sentences back to the parser. To realize this desideratum, we could follow the inside algorithm and recursively compose span representations (Le and Zuidema, 2015;Stern et al., 2017;Drozdov et al., 2019), which is, however, time-and memory-inefficient in practice. Instead, we produce span representations largely independently of the parser, as we will explain below. The only way the parser model influences this representation is through the predicted constituent label: we use its distribution to compute the representation. 3 Specificially, as a trade-off for a better training efficiency, we adopt a single-layer BiLSTM to encode spans. A mean-pooling layer is applied over the hidden states h of the BiLSTM and followed by a label-specific affine transformation f k (·) to produce a label-specific span representation c k . Take a span c i,j = w i . . . w j (0 < i < j ≤ n): The BiLSTM encoding model operates at the span level and encodes semantics of a span. Unlike using a single sentence-level (Bi)LSTM encoder, it guarantees that no information from words outside of the span leaks into its representations. More importantly, it can run in O(n) for a sentence of length n with a parallel implementation. While the produced representation does not reflect the structural decisions made by the parser, it can be sensitive to word order and may be affected by its syntactic structure (Blevins et al., 2018). In order to compute the representation of unlabeled constituent c, we average the label-specific span representation c k under the distribution of labels defined by the parser: where p(k|c, w) is the probability that the span c has label k, conditioned on having this constituent span in the tree.
To further reduce computation we estimate the matching loss only using the n(n−1) This is the case anyway (see discussion in Section 2.3), so we expect that this simplification would not hurt model performance significantly.

Joint objective
Rather than simply optimizing the contrastive learning objective, we additionally maximize the loglikelihood of text data. As with C-PCFGs, we optimize the ELBO: This learning objective complements contrastive learning. As contrastive learning optimizes a parser by solely matching images and constituents, the parser would only focus on simple and local constituents (e.g., short NPs). Moreover, in practice, since not every constituent can be grounded in an image, contrastive learning would suffer from misleading or ambiguous learning signals.
To summarize, the overall loss function is where α is a hyper-parameter balancing the relative importance of the contrastive learning.

Parsing
The parser can be directly used to parse raw text after training, without requiring access to visual groundings. Parsing seeks for the most probable parse t * of w: Still, though the maximum a posterior (MAP) inference over p θ (t|w) can be solved by the CYK algorithm (Kasami, 1966;Younger, 1967), inference becomes intractable when introducing into z. The MAP inference is instead approximated by where δ(·) is the Dirac delta function and µ φ (w) is the mean vector of the variational posterior q φ (z|w). As δ(·) has zero mass everywhere but at the mode µ φ (w), it is equivalently solving argmax t p θ (t|w, µ φ (w)).

Datasets and evaluation
Datasets: We use MSCOCO (Lin et al., 2014). It consists of 82,783 training images, 1,000 validation images, and 1,000 test images. Each image is associated with 5 caption sentences. We encode images into 2048-dimensional vectors using the pre-trained ResNet-101 (He et al., 2016). At test time, only captions are used. We follow Shi et al. (2019) and parse test captions with Benepar (Kitaev and Klein, 2018). We use the same data preprocessing 4 as in Shen et al. (2019) and Kim et al. (2019a), where punctuation is removed from all data, and the top 10,000 frequent words in training sentences are kept as the vocabulary. Evaluation: We mainly compare VC-PCFGs with VG-NSL (Shi et al., 2019). To verify the effectiveness of the use of visual groundings, we also compare our model with a C-PCFG trained only on the training captions. All models are run four times with different random seeds and for at most 15 epochs with early stopping (i.e., the image-caption loss / perplexity on the validation captions does not decrease). We report both averaged corpus-level F1 and averaged sentence-level F1 numbers as well as the unbiased standard deviations.

Settings and hyperparameters
We adopt parameter settings suggested by the authors for the baseline models. For VG-NSL we run the authors' code. 5 We re-implement C-PCFG using automatic differentiation (Eisner, 2016) to speed up training. Our VC-PCFG comprises a parsing model and an image-text matching model. The parsing model has the same parameters as the baseline C-PCFG; the image-text matching model has the same parameters as the baseline VG-NSL. Concretely, the parsing model has 30 nonterminals and 60 preterminals. Each of them is represented by a 256-dimensional vector. The inference model q φ (z|w) uses a singlelayer BiLSTM. It has a 512-dimensitional hidden state and relies on 512-dimensitional word embeddings. We apply a max-pooling layer over the hidden states of the BiLSTM and then obtain 64-dimensitional mean vectors µ φ (w) and log-variances log σ φ (w) by using an affine layer.
The image-text matching model projects visual features into 512-dimensitional feature vectors and encodes spans as 512-dimensitional vectors. Our span representation model is another single-layer BiLSTM, with the same hyperparameters as in the inference model. α for visually grounded learning is set to 0.001. We implement VC-PCFG relying on Torch-Struct (Rush, 2020), and optimize it using Adam (Kingma and Ba, 2015) with the learning rate set to 0.01, β 1 = 0.75, and β 2 = 0.999. All parameters are initialized with Xavier uniform initializer (Glorot and Bengio, 2010).

Main results
Our model outperforms all baselines according to both corpus-level F1 and sentence-level F1 (see Table 1). Notably, it surpasses VG-NSL+HI by 10% F1. 6 The right branching model is a strong baseline on image captions, as observed previously on the WSJ corpus, including in recent work (Shen et al., 2018;Kim et al., 2019a). Comparing with C-PCFG, which is trained solely on captions, VC-PCFG achieves a much higher mean F1 (+5.7% F1), demonstrating the informativeness of visual groundings. However, VC-PCFG suffers from a larger variance presumably because the joint objective is harder to optimize. Visually grounded contrastive learning (w/o LM) has a mean F1 50.5%. It is further improved to 59.4% when additionally optimizing the language modeling objective. Moreover, we show recall on six frequent constituent labels (NP, VP, PP, SBAR, ADJP, ADVP) in the test captions. Unsurprisingly, VG-NSL is best on NPs because the matching-based reward signals optimize it to focus only on short and concrete NPs (recall 64.3%). It performs poorly on other constituent labels such as VPs (recall 28.1%). In contrast, VC-PCFG exhibits a relatively even performance across constituent labels, e.g., it is most accurate on SBARs and ADVPs and works fairly well on VPs (recall 83.2%). Meanwhile, it improves over C-PCFG for NPs, which are usually short and 'concrete', once again confirming the benefits of using visual groundings. Visually grounded contrastive learning (w/o LM) tends to behave like  Table 1: Recall on six frequent constituent labels (NP, VP, PP, SBAR, ADJP, ADVP) in the MSCOCO test captions and corpus-level F1 (C-F1) and sentence-level F1 (S-F1) results. The best mean number in each column is in bold. † indicates results reported by Shi et al. (2019). denotes results obtained by running their code. Notice that the results from Shi et al. (2019) are not comparable to ours because they keep punctuation and include trivial sentence-level spans in evaluation. the right branching baseline. Additionally optimizing the language modeling objective brings a huge improvement for NPs (+19.3% recall).

Analysis
We analyze model performance for constituents of different lengths ( Figure 1). As expected, VG-NSL becomes weaker as constituent length increases, and the drop is very dramatic. C-PCFG and its grounded version VC-PCFG consistently outperform VG-NSL on constituents longer than four tokens and display a more even performance across constituent lengths. Meanwhile, VC-PCFG beats C-PCFG on constituents of length below 5, confirming that visual groundings are beneficial for short spans. We further plot the distribution over constituent length for different phrase types (Figure 2) and find that around 75% constituents in our dataset are shorter than six tokens, and 60% of them are NPs. Thus, it is not surprising that the im- provement on NPs, brought by visually grounded learning, has a large impact on the overall performance. Next, we analyze induced tree structures. We compare model predictions against gold trees, left branching trees, and right branching trees. As there is little performance difference between corpus-level F1 and sentence-level F1, we focus on sentence-level F1 in this analysis. We report self F1 (Williams et al., 2018) to show model consistency across runs. The self F1 is computed by averaging over six model pairs from four different runs. All results are presented in Table 2. Overall, all models have self F1 above 70%, indicating a relatively high consistency. We observe that using the head-initial bias pushes VG-NSL closer to the rightbranching baseline, while visual grounded learning   Figure 3 we visualize a parse tree predicted by the best run of VC-PCFG. We can see that VC-PCFG identifies most NPs but makes mistakes in PP attachement and consequently fails to identify the VP.

Related work
Grammar Induction has a long history in computational linguistics. Following observations that direct optimization of log-likelihood with the Expectation Maximization algorithm (Lari and Young, 1990) is not effective at producing effective grammars, a number of approaches have been developed, emboding various inductive biases or assumption about the language structure and its relation to surface realizations (Klein and Manning, 2002;Smith and Eisner, 2005;Cohen and Smith, 2009;Spitkovsky et al., 2010). The recent advances in the area have been brought by flexible neural models (Jin et al., 2019;Kim et al., 2019a,b;Drozdov et al., 2019). All these methods, with the exception of Shi et al. (2019), rely solely on text.
Visually grounded learning is motivated by the observation that natural language is grounded in perceptual experiences (Steels, 1998;Barsalou, 1999;Fincher-Kiefer, 2001;Roy, 2002;Bisk et al., 2020). It has been shown effective in word representation learning (Bruni et al., 2014;Silberer and Lapata, 2014;Lazaridou et al., 2015) and sentence representation learning (Kiela et al., 2018;Bordes et al., 2019). All this work uses visual images as perceptual experience of language and exploits visual semantics derived from images to improve continuous vector representatios of language. In contrast, we induce structured representations, discrete tree structure of language, by using visual groundings. We propose a model for the task within the contrastive learning framework. Learning involves estimating concreteness of spans, which generalizes word-level concreteness (Turney et al., 2011;Kiela et al., 2014).
In the vision and machine learning community, unsupervised induction of structured image representations (aka scene graphs or world models) has been receiving increasing attention (Eslami et al., 2016;Burgess et al., 2019;Kipf et al., 2020). However, they typically rely solely on visual signal. An interesting extension of our work would be to consider joint induction of structured representations of images and text while guiding learning by an alignment loss.

Conclusion
We have presented visually-grounded compound PCFGs (VC-PCFGs) that use compound PCFGs and generalize the visually grounded grammar learning framework. VC-PCFGs exploit visual groundings via contrastive learning, with learning signals derived from minimizing an image-text alignment loss. To tackle the issues of misleading and insufficient learning signals from purely agreement-based learning, we propose to complement the image-text alignment loss with a loss defined on unlabeled text. We resort to using compound PCFGs which enables us to complement the alignment loss with a language modeling objective, resulting in a fully-differentiable end-to-end visually grounded learning. We empirically show that our VC-PCFGs are superior to models that are trained only through visually grounded learning or only relying on text.