Structured Prediction with Output Embeddings for Semantic Image Annotation

We address the task of annotating images with semantic tuples. Solving this problem requires an algorithm which is able to deal with hundreds of classes for each argument of the tuple. In such contexts, data sparsity becomes a key challenge, as there will be a large number of classes for which only a few examples are available. We propose handling this by incorporating feature representations of both the inputs (images) and outputs (argument classes) into a factorized log-linear model, and exploiting the flexibility of scoring functions based on bilinear forms. Experiments show that integrating feature representations of the outputs in the structured prediction model leads to better overall predictions. We also conclude that the best output representation is specific for each type of argument.


Introduction
Many important problems in machine learning can be framed as structured prediction tasks where the goal is to learn functions that map inputs to structured outputs such as sequences, trees or general graphs. A wide range of applications involve learning over large state spaces, e.g., if the output is a labeled graph, each node of the graph may take values over a potentially large set of labels. Data sparsity then becomes a challenge, as there will be many classes with very few training examples.
Within this context, we are interested in the task of predicting semantic tuples for images. That is, given an input image we seek to predict what are the events or actions (referred here as predicates), who and what are the participants (referred here as actors) of the actions and where is the action taking place (referred here as locatives). For example, an image might be annotated with the semantic tuples: run, dog, park and play, dog, grass . We call each field of a tuple an argument.
To handle the data sparsity challenge imposed by the large state space, we will leverage an approach that has proven to be useful in multiclass and multilabel prediction tasks (Weston et al., 2010;Akata et al., 2013). The idea is to represent a value for an argument a using a feature vector representation φ ∈ IR n . We will integrate this argument representation into the structured prediction model.
In summary, our main contribution is to propose an approach that incorporates feature representations of the outputs into a structured prediction model, and apply it to the problem of annotating images with semantic tuples. We present an experimental study using different output feature representations and analyze how they affect performance for different argument types.

Semantic Tuple Image Annotation
Task: We will address the task of predicting semantic tuples for images. Following Farhadi et al.
(2010), we will focus on a simple semantic representation that considers three basic arguments: predicate, actors and locatives. For example, in the tuple play, dog, grass , "play" is the predicate, "dog" is the actor and "grass" is the locative.
Given this representation, we can formally define our problem as that of learning a function θ : X × P × A × L → IR that scores the compatibility between images and semantic tuples. Here X is the space of images; P , A and L are discrete sets of predicate, actor and locative arguments respectively, and p a l is a specific tuple instance. The overall learning process is illustrated in Fig. 1.
Dataset: For our experiments we used a subset of the Flickr8k dataset, proposed in Hodosh et al. (2013). This dataset (subset B in Fig. 1) consists of 8,000 images from Flickr of people and animals

Training'Data'
Embedded'CRF'' (Implicitly)induces) embedding)of)image) features)and)arguments)) Figure 1: Overview of our approach. First, images x ∈ A are represented using image features φs(x), and semantic tuples are obtained applying our semantic tuple extractor (learned from the subset C) to their corresponding captions. The resulting enlarged training set, is used to train our embedded CRF model that maps images to semantic tuples.
(mostly dogs) performing some action, with five crowd-sourced descriptive captions for each one.
We first manually annotated 1,544 captions, corresponding to 311 images (approximately one third of the development set (subset C in Fig. 1), producing more than 2,000 semantic tuples of predicate, actor and locative. For the experiments we partitioned the images and annotations into training, validation and test sets of 150, 50 and 100 images respectively.
Data augmentation: To enlarge the manually annotated dataset we trained a model able to predict semantic tuples from captions using standard shallow and deep linguistic features (e.g., POS tags, dependency parsing, semantic role labeling). We extract the predicates by looking at the words tagged as verbs by the POS tagger. Then, the extraction of arguments for each predicate is resolved as a classification problem.
More specifically, for each detected predicate in a sentence we regard each noun as a positive or negative training example of a given relation depending on whether the candidate noun is or is not an argument of the predicate. We use these examples to train a SVM classifier that predicts if a candidate noun is an argument of a given predicate based on several linguistic features computed over the syntactic path of the dependency tree that connects them. We run the learned tuple predictor model on all the captions of the Fickr8k dataset to obtain a larger dataset of 8,000 images paired with semantic tuples.

Bilinear Models with Output Features
In this section we explain how we incorporate output feature representations into a factorized linear model. For simplicity, we will consider factorized sequence models over sequences of fixed length. However, it should not be hard to see that all the ideas presented here can be easily generalized to other structured prediction settings.
Let y = [y 1 . . .y T ] be a set of labels and S = [S 1 , . . ., S T ] be the set of possible label values, where y i ∈S i . We are interested in learning a model that computes P (y|x), i.e., the conditional probability of a sequence y given some input x. We will consider factorized log-linear models that take the form: The scoring function θ(x, y) is modeled as a sum of unary and binary bilinear potentials and is defined as: (2) where v yt ∈ IR nt is a n t −dimensional feature representation of label arguments y t ∈S t and φ(x, t) ∈ IR dt is a d t −dimensional feature representation of the t th input factor of x.
The first set of terms in the above equation are usually referred as unary potentials and measure the compatibility between a single state at t and the feature representation of input factor t. The second set of terms are the binary potentials and measure the compatibility between pairs of states at adjacent factors. The scoring θ(x, y) function is fully parameterized by the unary parameter matrices W t ∈IR nt×dt and the binary parameter matrices Z t ∈IR nt×nt .
The main idea is to define a feature space where semantically similar labels will be close. Like in the multilabel scenario (Weston et al., 2010;Akata et al., 2013), having full feature representations for arguments will allow us to share information across different classes and generalize better. With a good output feature representation, our model should be able to make sensible predictions about pairs of arguments that it has not observed at training. This is easy to see: consider a case were we have a pair of arguments represented with feature vectors a 1 and a 2 and suppose that we have not observed the factor a 1 , a 2 in our training data but we have observed the factor b 1 , b 2 . Then if a 1 is close in the feature space to argument b 1 and a 2 is close to b 2 our model will predict that a 1 and a 2 are compatible. That is it will assign probability to the factor a 1 , a 2 which seems a natural generalization from the observed training data.
Now we show that the rank of W and Z have useful interpretations. Let W = U ΣV be the singular value decomposition of W . We can then write unary Thus we can regard the bilinear form as a function computing a weighted inner product over some real embedding v y U representing state y and some real embedding [V φ(x, t)] representing input factor t. The rank of W gives us the intrinsic dimensionality of the embedding. Thus if we want to induce shared low-dimensional embeddings across different states it seems reasonable to impose a low rank penalty on W . Similarly, let Z = U ΣV be now the singular value decomposition of Z. We can write the binary potentials v y Zv y as: v y U Σ V v y and thus the binary potentials compute a weighted inner product between a real embedding of state y and a real embedding of state y . As before, the rank of Z gives us the intrinsic dimensionality of the embedding and, to induce a low dimensional embedding for binary potentials, we will impose a low rank penalty on Z.
After having described the type of scoring functions we are interested in, we now turn our attention to the learning problem. That is, given a training set D = { x y } of pairs of inputs x and output sequences y we need to learn the parameters {W } and {Z}. For this purpose we will do standard max-likelihood estimation and find the parameters that minimize the conditional negative log-likelihood of the data in D. That is, we will find the {W } and {Z} that mini-mize the following loss function L(D, {W }, {Z}): − x y ∈D logP (y|x; {W }, {Z}) Recall that we are interested in learning low-rank unary and binary potentials. To achieve this we take a common approach which is to use as the nuclear norm |W | * and |Z| * as a convex approximation of the rank function, the final optimization problem becomes: is the negative log likelihood function and α and β are two constants that control the trade off between minimizing the loss and the implicit dimensionality of the embeddings. We use a simple optimization scheme known as Forward Backward Splitting, or FOBOS (Duchi and Singer, 2009).
For our task we will consider a simple factorized scoring function: θ(x, p a l ) that has one factor associated with the locative-predicate pair and one factor associated with the predicate-actor pair. Since this corresponds to a chain structure, argmax t∈T θ(x; p a l ) can be efficiently computed using Viterbi decoding in time O(N 2 ), where N = max(|P |, |A|, |L|). Similarly, we can also find the top k predictions in O(kN 2 ). Thus for this application the scoring function of the bilinear CRF will take the form of: θ(x, p a l ) = λ loc (l) W loc φ loc (l) +λ pre (p) W pre φ pre (p) +λ act (a) W act φ act (a) +φ loc (l) W loc pre φ pre (p) +φ pre (p) W pre act φ act (a) (4) The unary potentials measure the compatibility between an image and a semantic argument, the first binary potential measures the compatibility between a locative and a predicate, and the second binary potential measures the compatibility between a predicate and an actor. The scoring function is fully parameterized by the unary parameter matrices W loc ∈ IR d l ×n l , W pre ∈ IR dp×np and W a ∈ IR da×na and the binary parameter matrices W loc pre ∈ IR n l ×np and W pre act ∈ IR np×na . Where, n l , n p and n a are the dimensionality of the locatives, predicates and actors feature representations, respectively and d l , d p and d a are the dimensionality of the image representations. Notice that if we let the argument representation φ t (r) ∈ IR |St| be an indicator vector for label argument t, we obtain the usual parametrization of a standard factorized linear model, while having a dense feature representations for arguments instead of indicator vectors will allow us to share information across different classes.

Representing Semantic Arguments
We will conduct experiments with two different feature representations: 1) Fully unsupervised Skip-Gram based Continuous Word Representations (SCWR) representation (Mikolov et al., 2013) and 2) A feature representation computed using the caption, semantic-tuples pairs, that we call Semantic Equivalence Representation (SER).
We decided to exploit the dataset of captions paired with semantic tuples to induce a useful feature representation for arguments. The idea is quite simple: we wish to leverage the fact that any pair of semantic tuples associated with the same image will be likely describing the same event. Thus, they are in essence different ways of lexicalizing the same underlying concept. Let's look at a concrete example. Imagine that we have an image annotated with the tuples: play, dog, water and play, dog, river . Since both tuples describe the same image, it is quite likely that both "river" and "water" refer to the same real world entity, i.e, "river" and "water" are 'semantically equivalent' for this image. Using this idea we can build a representation φ loc (i) ∈ IR |L| where the j-th dimension corresponds to the number of times the argument j has been semantically equivalent to argument i. More precisely, we compute the probability that argument j can be exchanged with argument i as: [i,j]sr j [i,j]sr Where [i, j] sr is the number of times that i and j have appeared as annotations of the same image and with the same other arguments. For example, for the actor arguments [i, j] sr represents the number of time that actor i and actor j have appeared with the same locative and predicate as descriptions of the same image.

Related Work
In recent years, some works have tackled the problem of generating rich textual descriptions of images. One of the pioneers is (Kulkarni et al., 2011), where a CRF model combines the output of several vision systems to produce input for a language generation method. In Farhadi et al. (2010), the authors find the similarity between sentences and images in a "meaning" space, represented by semantic tuples which are very similar to our triplets. Other works focus on a simplified problem: ranking of humangenerated captions for images. Hodosh et al. (2013) propose to use Kernel Canonical Correlation Analysis to project images and their captions into a joint representation space, in which images and captions can be related and ranked to perform illustration and annotation tasks. Socher et al. (2014) also address the ranking of images given a sentence and viceversa using a common subspace learned via Recursive Neural Networks. Other recent works also exploit deep networks to address the problem (Vinyals et al., 2015;Karpathy and Fei-Fei, 2015). Using label embeddings combined with bilinear forms has been previously proposed in the context of multiclass and multilabel image classification (Weston et al., 2010;Akata et al., 2013).

Experiments
For image features we use the 4,096-dimensional second to last layer of BVLC implementation of 'AlexNet' ImageNet model, a Convolutional Neural Network (CNN) as described in Jia et al. (2014). To test our method we used the 100 test images that were annotated with ground-truth semantic tuples. To measure performance we first predict the top tuple for each image and then measure accuracy for each argument type (i.e. the number of correct predictions among the top 1 triplets). The regularization parameters of each model were set using the validation set. We compare the performance of the following models: 1) Baseline Separate Predictors (S-Pred): We consider a baseline made of independent predictors for each argument type.
More specifically we train one-vs-all SVMs (we also tried multi-class SVMs but they did not improve performance) to independently predict locatives, predicates and actors. For each argument type and candidate label we have a score computed by the corresponding SVM. Given an image we generate the top tuples that maximize the sum of scores for each argument type; 2) Baseline KCCA: This model implements the Kernel Canonical Correlation pre=sit,loc=pool>3 <act=dog,pre=run,loc=grass>3 et>3 <act=man,3pre=ride,3loc=street>3 <act=boy,pre=play,loc=field>3 <act=people,pre=sit,loc=camera>3 <act=dog,pre=run,loc=water>3 <act=dog,3pre=run,3loc=water>3 <act=dog,pre=stand,loc=field>3 <act=dog,pre=perform,loc=air>3 <act=woman,pre=sit,loc=pool>3 <act=player,pre=hold,loc=football>3 Incorrect ( 3  3  3  3  3  3  3   <act=people,pre=perform,loc=air>3  <act=people,pre=jump,loc=air>3  <act=people,pre=wear,loc=air>3  <act=people,pre=watch,loc=air>3  <act=people,pre=perform,loc=pool>3  <act=people,pre=sit,loc=air>3 <act=people,pre=gather,loc=air>3 ysis approach of Hodosh et al. (2013). We first note that this approach is able to rank a list of candidate captions but cannot directly generate tuples. To generate tuples for test images, we first find the caption in the training set that has the highest ranking score for that image and then extract the corresponding semantic tuples from that caption; 3) Indicator Features (IND), this is a standard factorized log-linear model that does not use any feature representation for the outputs; 4) A model that uses the skip-gram continuous word representation of outputs (SCWR); 5) A model that uses that semantic equivalence representation of outputs (SER); 6) A combined model that makes predictions using the best feature representation for each argument type (COMBO).  Table 1 shows the results. We observe that our proposed method performs significantly better than the baselines. The second observation is that the best performing output feature representation is different for different argument types, for the locatives the best representation is SER, for the predicates is the SCWR and for the actors using an output feature representation actually hurts performance. The biggest improvement we get is on the predicate arguments, where we improve almost by 10% in average precision over the baseline using the skip-gram word representation. Overall, the model that uses the best representation performs better than the indicator baseline.
Regarding the rank of the parameter matrices, we observed that the learned models can work well even if we drop the rank to 10% of its maximum rank. This shows that the learned models are efficient in the sense that they can work well with lowdimensional projections of the features.

Conclusion
In this paper we have presented a framework for exploiting input and output embeddings in the context of structured prediction. We have applied this framework to the problem of predicting compositional semantic descriptions of images. Our results show the advantages of using output embeddings and inducing low-dimensional embeddings for handling large state spaces in structured prediction problems. The framework we propose is general enough to consider additional sources of information.