Phrase Grounding by Soft-Label Chain Conditional Random Field

The phrase grounding task aims to ground each entity mention in a given caption of an image to a corresponding region in that image. Although there are clear dependencies between how different mentions of the same caption should be grounded, previous structured prediction methods that aim to capture such dependencies need to resort to approximate inference or non-differentiable losses. In this paper, we formulate phrase grounding as a sequence labeling task where we treat candidate regions as potential labels, and use neural chain Conditional Random Fields (CRFs) to model dependencies among regions for adjacent mentions. In contrast to standard sequence labeling tasks, the phrase grounding task is defined such that there may be multiple correct candidate regions. To address this multiplicity of gold labels, we define so-called Soft-Label Chain CRFs, and present an algorithm that enables convenient end-to-end training. Our method establishes a new state-of-the-art on phrase grounding on the Flickr30k Entities dataset. Analysis shows that our model benefits both from the entity dependencies captured by the CRF and from the soft-label training regime. Our code is available at github.com/liujch1998/SoftLabelCCRF


Introduction
Given an image and a corresponding caption, the phrase grounding task aims to ground each entity mentioned by a noun phrase in the caption to a region in the image. Phrase grounding has attracted much research interest due to its application in downstream tasks including image captioning (Karpathy et al., 2014;Fang et al., 2015;Donahue et al., 2017;Xu et al., 2015), image retrieval Radenovic et al., 2016), and visual question answering (Agrawal et al., 2017;Yu et al., 2017Yu et al., , 2018a. (a) Dependency between entities. The visual relationship between grounding regions for "cheerleaders" and "a girl" should agree with context "toss ... high up into the air".
(b) Gold label multiplicity. The green box is the annotated gold grounding region for entity phrase "Old man", while the orange dash boxes are region proposals with IoU ≥ 0.5 with gold. Figure 1: Example image-caption pairs from Flickr30k Entities, illustrating entity dependencies and gold label multiplicity.
Phrase grounding systems typically work by ranking a set of candidate regions . Region proposals are generated from the image by a vision backbone model, without conditioning on the caption. Features of the phrase to be grounded are extracted, and subsequently interact with features of candidate regions, to determine phrase-region compatibility. Candidate regions are then ranked based on this compatibility metric, and the highestscored candidate region is selected as the predicted grounding of the phrase.
In Flickr30k Entities , each caption contains an average of 2.76 entity phrases to ground (Figure 2a; phrases with no corresponding gold regions are not counted). It therefore stands to reason that phrases in the same caption should not be grounded independently (to op-  timize each individual phrase-region assignment), but jointly (to optimize the global phrase-region assignment for the entire caption). Figure 1a illustrates this phenomenon. The caption contains a sequence of two entity phrases, "cheerleaders" and "a girl", and the task is to label each phrase with a candidate region that best grounds it. Since there are several women present in the image, "a girl" has ambiguous grounding by itself, but it can be disambiguated by encouraging the visual relationship between "a girl" and "cheerleaders" to conform with context provided in the caption.
Some works are aware that dependencies between entities in the same caption play an important role in building more accurate phrase grounding systems (Wang et al., 2016;. The success of these structured prediction methods shows the advantage of considering entity dependencies in learning and prediction. However, these approaches capture certain relations in an ad hoc manner, and resort to approximate inference (Wang et al., 2016; or non-differentiable losses . To obtain models and inference algorithms that facilitate more globally consistent phrase grounding predictions, we propose to formulate phrase grounding as a sequence labeling task where we treat candidate regions as potential labels for the phrases in the input sequence. This allows us to build phrase grounding models based on Conditional Random Fields (CRFs) (Lafferty et al., 2001) that capture entity dependencies in a universal and differentiable manner. Our results indicate that systems that capture dependencies between phrases in the same caption in a principled manner outperform systems that ignore these dependencies.
A second problem lies in the use of region proposals, which distinguishes phrase grounding from other sequence labeling tasks where CRFs are directly applicable. Following the metrics of object detection, in phrase grounding the correctness of a predicted region is judged by its overlap by Intersection-over-Union (IoU) with the gold region . To cover potential regions with high enough IoU, it is common to generate a myriad of region proposals and for these candidate regions to contain or substantially overlap with each other. As a result, there could be more than one candidate region with high IoU with the gold region, and they should all be considered as correct grounding for the phrase. This phenomenon of gold label multiplicity is illustrated in Figure 1b. We hypothesize that it is important to consider gold label multiplicity and identify all correct region proposals during training, since the model would receive contradictory training signals if some correct proposals were marked as incorrect. With region proposals generated by a Bottom-Up Attention  visual backbone, in Flickr30k Entities each phrase has an average of 4.75 gold labels, and detailed statistics are presented in Figure 2b. To address this problem, we adopt the soft-label target distribution proposed by , and our experiments show that models trained with this regime significantly outperform those trained with one-hot target regime.
To combine the benefits brought by structured prediction from CRFs and by soft-label training regime, we define Soft-Label Chain CRFs, a variation of standard chain CRFs that allows us to work with gold label multiplicity. We adapt learning and inference algorithms from chain CRFs and develop an end-to-end training algorithm for our proposed model.
We evaluate the effectiveness of Soft-Label Chain CRF on phrase grounding by conducting experiments on the Flickr30k Entities dataset  and comparing grounding accuracy with strong baseline models, as well as with existing structured prediction methods and current state-of-the-art models. Experimental results show that our Soft-Label Chain CRF model outperforms its hard-label CRF counterpart by 2.43%, a vanilla non-CRF soft-label model by 0.40%, and the previous best results by about 1.4%, demonstrating that both of our contributions, modeling phrase grounding as a sequence labeling task, and training with soft label targets, matter for this task.

Related Work
Phrase Grounding. The phrase grounding task was first postulated by Karpathy and Fei-Fei (2017) and , both of which moved from the holistic image captioning to the finer-grained task of matching regions with phrases in the caption. Datasets for this task include Flickr30k Entities (Plummer et al., 2017b), RefCOCO (Yu et al., 2016), and Visual Genome (Krishna et al., 2017). The general framework of proposal-generation-ranking has become adopted by most approaches to phrase grounding, and research in this area has focused on improving specific components of this framework. Our work can be viewed as an improvement to the training and prediction aspects. Structured Prediction in Phrase Grounding. We summarize some works that consider entity dependencies by structured prediction. Structured Matching (Wang et al., 2016) formulates phrase grounding as a bipartite matching process between phrases and candidate regions, and encourages the spatial relationship between two grounding regions to conform to an extracted partial coreference relation between their corresponding phrases. The resulting discrete optimization problem is then relaxed into a linear program to enable end-to-end training. Phrase-Region CCA (Plummer et al., 2017a) mines frequent patterns of semantically related paired phrases and trains a separate model for each pattern. The addition of this pairwise score makes the optimization a quadratic programming problem that requires approximate inference. QRC Net  assumes that phrases in a caption refer to distinct entities, and thus predicted grounding regions are penalized for spatial overlapping. However, overlapping regions can be penalized only after prediction, so this loss is not differentiable, and one has to resort to reinforcement learning. In these works, partial coreference extraction, frequent patterns mining and spatial overlap penalties are ad hoc entity dependency capturing, while we aim to universally encompass the spectrum of such dependencies. Soft-Label Training Regime. Conventionally, region proposal ranking is done by predicting a probability distribution over all candidate regions for grounding a given entity phrase, which is learned to match a target distribution.  and Rohrbach et al. (2016) define the tar-get distribution as a one-hot vector which only gives credit to the candidate region with highest IoU with the gold region, and cross-entropy loss is used as training objective. Under this hard-label training regime, the model is trained to pick only the best candidate region while rejecting all the inferior-than-best candidate regions, which is intuitively not a good behavior.  proposes a soft-label target distribution which gives weighted credit to all good candidate regions (i.e. those with above-threshold IoU with the gold region), and uses Kullback-Leibler (KL) divergence loss as training objective. Conditional Random Fields. CRFs (Lafferty et al., 2001) are discriminative probabilistic models that have been found useful in sequence labeling tasks by capturing label dependencies (Ma and Hovy, 2016;Lample et al., 2016). We summarize some works relevant to CRFs learned in softlabel or multi-label settings. Multi-CRFs (Dredze et al., 2009) learn CRFs with noisy annotated data, where annotators may disagree on the label for input tokens. The assumption is that there is always only one gold label for each token, so the model favors single label while conforming to the prior distribution of labels set by annotators. To work with soft-label targets, it employs a mode-seeking, exclusive KL divergence definition, which does not imply moment-matching, a desired property of CRFs (and in general, exponential family models) that we show in Section 3.1 and 3.2 for the meanseeking, inclusive KL divergence definition in our model. Rodrigues et al. (2014) models the latent reliability of individual annotators, and use this information to guide the selection of trustworthy annotation sources and estimation of real gold labels. Note that both works always assume one gold label per input token, where the ambiguity comes from unreliability of annotations, while our work focuses on cases where there may be multiple gold labels per input token by the nature of the task.

Soft-Label Chain CRF
CRFs model the probability of a label sequence y = y 1:T conditioned on an input sequence x = x 1:T in terms of a score function s(x, y): For a given training example {(x, y)}, the negative log-likelihood loss (i.e. cross-entropy loss w.r.t. a one-hot target distribution that gives credit to the gold label only) is where Z(x) = y exp s(y , x). The gradient of this loss w.r.t. score function is ∂L ∂s(y , x) = −I(y = y) + p(y |x) which is known as moment-matching. This allows us to train CRFs with gradient methods and conveniently connect to backpropagation when the score function is modeled by a neural architecture.

Soft-Label CRF
In the standard CRF above, each input x t corresponds to a single gold label y t . To account for gold label multiplicity in training stage, we replace the sequence of gold labels y with a sequence of distributions q = q 1:T where q t ∈ R K is the gold label distribution over all K possible labels for input x t . Note that this distribution should not be interpreted as the confidence of each label being correct; rather, it should be understood as a probabilistic gold label model: if we randomly choose a gold label, how likely is each label to be selected. With independence assumption, the gold probability of an arbitrary label sequence y is It is easy to see that q(y|x) is a distribution: And our goal is to learn this target distribution. Since this target distribution is no longer degenerate, we use Kullback-Leibler (KL) divergence to measure the discrepancy between the model and the target distribution. Our training objective is the KL divergence loss (in mean-seeking, inclusive form): which also gives gradients that demonstrate moment-matching: ∂L ∂s(y , x) = −q(y |x) + p(y |x) Note that if we had defined the KL divergence loss in its mode-seeking, exclusive form y p(y|x) log p(y|x) q(y|x) , we would have lost this desired moment-matching property.

Factorization of Soft-Label Chain CRF
Learning CRFs of general graphs requires inference in unit of cliques, which is usually computationally intractable. By restricting to local, pairwise potentials, we reduce the model to a firstorder linear chain CRF, whose scoring function factorizes as where τ (·, ·, ·) is the transition score between labels at t−1 and t that captures the dependency between labels for adjacent input tokens, and ε(·, ·) is the emission score between label and input at t.
Combining this factorization with soft-label targets gives the formal definition of Soft-Label Chain CRF. The loss can be written as (Reorganize sums by s(y t ,y t−1 ,x)) = t y t q(y t |x) log q(y t |x) Algorithm 1 Modified forward algorithm to compute the KL divergence loss for Soft-Label Chain CRFs procedure SOFTLABELCHAINCRFLOSS(q, ε(y t , x), τ (y t , y t−1 , x)) for all label y 0 do α 0 which gives moment-matching gradients where q(y t |x) = q t yt q(y t , y t−1 |x) = q t y t q t−1 y t−1 are the probability of local label(s) marginalized over all possible non-local labels. Smoothing inference p(y t |x) and p(y t , y t−1 |x) can be computed with forward-backward algorithm.

As an Extension of Soft-Label Model
Note that if we omit all transition terms in Soft-Label Chain CRF, the loss reduces to which is a total factorization over time. This is as if each label is predicted independently using a soft-label training regime, which is exactly the KL divergence loss proposed by . Therefore, our Soft-Label Chain CRF can be viewed as an extension of this soft-label discriminative model.

Modified Forward Algorithm
For chain CRFs, computing the loss only requires forward algorithm, while computing the gradients requires a full forward-backward algorithm. It can be proved that backpropagation on the loss gives the same result as running forwardbackward. This is a commonly used trick in modern deep learning frameworks to eliminate the need of implementing the backward pass. Algorithm 1 presents a modified forward algorithm that computes the loss for Soft-Label Chain CRF. In Section 1 and 2 of the Supplementary Materials, we prove the correctness of this algorithm, and that its backpropagation is also equivalent to forward-backward.

Task Formulation
We formulate phrase grounding as a sequence labeling task. Given an image I, a caption sentence [c 1 . . . c L ] where c l is a word token, and a set of non-overlapping noun phrase spans [p 1 . . . p T ] where p t = (s t , e t ) denotes that the t'th phrase covers tokens c s t to c e t (inclusive), we generate a set of region proposals {r 1 . . . r K }, label each phrase with a candidate region, and refine the region by performing a bounding box regression.
Figure 3: Our model for phrase grounding as a sequence labeling task. The K×K transition score matrix is derived from the features of K region proposals. The T × K emission score matrix is derived from a joint representation of phrase-region pairs, which is fused from features of region proposals and T entity phrases. Bounding box regression is applied to the sequence of regions predicted by the CRF. Cyan dashed line: contextualized transition score prediction (Section 4.2).
Figure 4: Text feature extraction for phrases in a caption. Shaded regions are entity phrase spans; circles represent LSTM cells. For phrase t hidden states at its span boundaries are concatenated to form its text features p t , which is used in fusion with region features. For the contextualized transition score between phrases t−1 and t, hidden states at the boundaries of the context between them are concatenated into a context feature vector p t−1,t , which can be further extended by phrase features p t−1 and p t as well as global text features p G .

Model Specification
Figure 3 outlines our phrase grounding model. K region proposals and their visual and spatial features are extracted from an object detection vision backbone. We feed the token embeddings of the caption into a bi-directional LSTM (Hochreiter and Schmidhuber, 1997), and then concatenate the forward hidden state at the ending boundary of the phrase with the backward hidden state at the starting boundary of the phrase (see Figure 4). This phrase representation captures context both preceding and following the phrase in the caption.
We use low-rank bilinear pooling (LRBP) (Kim et al., 2017) to fuse text and region features. Compared to simple concatenation, LRBP supports pairwise interaction between bimodal feature channels while keeping a reasonable computation overhead. Given a text feature vector p t ∈ R dtext and a region feature vector r k ∈ R d vis , LRBP fuses them into a joint representation f t k ∈ R d joint : where U ∈ R dtext×r , V ∈ R d vis ×r , pooling matrix P ∈ R r×d joint , bias b ∈ R d joint , and • is the Hadamard (i.e. element-wise) product. As discussed in Section 3.2, the CRF score function consists of emission score and transition score. The emission score ε(r k , p t ) models the compatibility between each phrase and each candidate region. We feed the joint representation to a single-layer feed-forward neural network: The transition score τ (r k , r k , p 1:T ) is modeled by a two-layer feed-forward neural network with ReLU activation for the hidden layer: τ (r k , r k , p 1:T ) = FFN(σ(FFN([r k ||r k ]))) To condition the transition scores on local and global context from the caption, we can extend the input [r k ||r k ] with the following text features: context in between the two phrases (feature vector p t−1,t ), context from phrase features p t−1 and p t , and global context p G . One important difference between the standard use of CRFs for sequence labeling and our task is that our "labels" do not correspond to a fixed set of classes that can be predicted for any input, but are as specific to the particular input example as the sequences to be labeled themselves. Hence, our transition and emission scores do not depend on the (arbitrary) indices of regions to be ground, but on their visual and spatial features (as well as on their corresponding linguistic contexts). Finally, although our approach could in principle be extended to higher-order CRFs, we restrict our attention here to first-order CRFs for computational efficiency. As a consequence, our models can only capture dependencies between stringadjacent phrases.

Training Objectives
For each image-caption instance, the loss is a linear combination of the labeling and bounding box regression loss: L = L label + γL reg L label is the CRF loss defined in Section 3.2. L reg (Ren et al., 2017) is defined as with the ground truth regression parameterization and

Experiment Setup
Dataset. We train and evaluate our models on the Flickr30k Entities dataset (Plummer et al., 2017b), which contains 31, 783 images, each accompanied by 5 captions. In keeping with previous work on this dataset, we assume that entity phrase boundaries are given, so inferring which phrases to ground is not part of our task. Following , we merge all regions that are ground to the same phrase into one larger bounding box, and split the dataset into 29, 783 training images, 1k validation images and 1k test images.
We do not apply our method to RefCOCO (Yu et al., 2016) or Visual Genome (Krishna et al., 2017) because they consist of independently grounded entity phrases without any entity dependencies that CRFs could leverage. Implementation details. For text feature extraction, we use the 1024-d contextualized word embeddings from the last layer of ELMo (Peters et al., 2018), followed by a bi-directional LSTM (Hochreiter and Schmidhuber, 1997) encoder with hidden dimension d hidden = 512 for each direction, so that the text feature vector has dimension d text = 1024. We use the Bottom-Up Attention model  to generate region proposals and extract visual features, as in the state-of-the-art BAN  and DDPN  models. K = 100 region proposals are generated for each image. Each candidate region with coordinates (x min , y min ), (x max , y max ) is represented by a d vis = 2053 feature vector that consists of 2048-d visual features concatenated with 5-d spatial features [x min /W, y min /H, x max /W, y max /H, wh/W H]. The low-rank bilinear pooling (LRBP) layer used for text-region bimodal feature fusion has rank r = 1024 and output dimension d joint = 1024. We train with a mini-batch size of 16 image-caption instances.
Each instance contains all entity phrases to be grounded in the caption. Weights are initialized with Xavier (Glorot and Bengio, 2010). We apply a dropout rate of p = 0.2 after the word embedding layer, LSTM layer, and LRBP fusion layer. The loss weighting parameter γ is 10.0. All gradients are clipped by ∞-norm of 10.0 to prevent gradient explosion. We do not fine-tune ELMo or the Bottom-Up Attention model. All models are trained for 50k iterations using Adam (Kingma and Ba, 2015) with learning rate 5e − 5 and β 1 = 0.9, β 2 = 0.98. Model snapshots are taken every 5k iterations and the model with the highest validation set accuracy is selected. Metrics. We predict one grounded region for each entity phrase. Following Plummer et al. (2017b), a prediction is deemed accurate if it has at least 0.5 IoU overlap with the gold region. We report the percentage of accurately grounded phrases.

Quantitative Results
We compare our Soft-Label Chain CRF model against three baselines: a Hard-Label non-CRF model, a Hard-Label CRF, and a Soft-Label non-CRF model. The non-CRF models ground each phrase independently with a loglinear model. The Hard-Label models are trained with a standard one-hot training regime. The Soft-Label models use the soft-label training regime described above. The Soft-Label non-CRF model corresponds to the reduced form of the Soft-Label Chain CRF in Section 3.3. Table 1 shows the performance of previous structured prediction models, current state-of-theart models, our baseline models and the Soft-Label Chain CRF model. For a fair comparison with BAN , we also report result of the hard-label baseline with GloVe (Pennington et al., 2014) embeddings, while we obtain 0.33% higher result with ELMo. Training a non-CRF model on soft-label target distributions Table 1: Performance of different phrase grounding methods on Flickr30k Entities (test set). Our CRF models has transition scores conditioned on features of context in between the two phrases ("M" in Table 2). Our methods, unless explicitly specified, uses ELMo (Peters et al., 2018) as word embeddings.

Model
Transition   improves accuracy by a further 2.08%. On top of that, Soft-Label Chain CRF improves accuracy by another 0.40%, which shows the effectiveness of treating phrase grounding as a sequence labeling task and using CRFs to capture entity dependencies. We also observe that the Hard-Label Chain CRF outperforms the hard-label baseline by a mere margin of 0.05%, so our conjecture is that using chain CRFs works well only with a suitable choice of training regime. Soft-Label Chain CRF gives an overall improvement of 2.48% over the hard-label baseline; it significantly outperforms previous structured prediction models including Structured Matching (Wang et al., 2016), Phrase-Region CCA (Plummer et al., 2017a) and QRC Net , and surpasses the state-of-the-art BAN  and DDPN ) models by a margin of 5.00% and about 1.4%, respectively. We conduct an ablation study to find the most appropriate combination of context features for the transition scores in the SL-CCRF model. Table 2 shows that we obtain the best results by including the context in between the two phrases, which gives an improvement of 0.41%. We did not see any benefit from adding further text features from the left and right side of the phrases, or from the entire caption.
Besides the Viterbi decoding algorithm used in prediction in CRFs, we also experiment with a smoothing decoding algorithm. While Viterbi finds the MAP label sequence conditioned on the input sequence arg max y p(y|x), smoothing decoding finds the best label for each input x t : arg max y t p(y t |x). This makes sense in some scenarios where we want to refine the predicted grounding of one entity by referring to the context instead of attempting to ground all entities mentioned in the description. Table 3 shows that in both Hard-Label Chain CRF and Soft-Label Chain CRF, smoothing decoding gives a prediction accuracy 0.04% higher than Viterbi decoding.
Without bounding box regression, the Soft-Label Chain CRF model has an accuracy of 69.85%, a 4.84% reduction compared to the setting with bounding box regression.

Qualitative Results
We visualize some phrase grounding results in the validation set of Flickr30k Entities in Figure 5. In (a), our CRF model avoids the error in grounding "a lounge chair" by constraining its relative posi- tion to "a man". In (b), although it may not have learned to distinguish "headband" and "hat", the CRF constrains the spatial position of "headband" to agree with the ownership dependency provided in context. In (c), it avoids the error in grounding "skirt" by spatially discriminating it from "a blouse". In (d), it avoids the error in grounding "a cleanser" by constraining its relative size w.r.t. "a child". These examples indicate that the CRF model may avoid grounding errors made by non-CRF models by leveraging entity dependencies, including relative position, spatial overlapping, and relative size.

Conclusion
In this paper, we formulate phrase grounding as a sequence labeling task and propose the Soft-Label Chain CRF model that successfully combines the benefits brought by global structured prediction and soft-label training regime that addresses the gold label multiplicity problem. Experimental results show that we achieve an overall improvement of 2.48% on grounding accuracy compared to a strong baseline, and that our model outperforms previous methods on phrase grounding.