VICTR: Visual Information Captured Text Representation for Text-to-Image Multimodal Tasks

Text-to-image multimodal tasks, generating/retrieving an image from a given text description, are extremely challenging tasks since raw text descriptions cover quite limited information in order to fully describe visually realistic images. We propose a new visual contextual text representation for text-to-image multimodal tasks, VICTR, which captures rich visual semantic information of objects from the text input. First, we use the text description as initial input and conduct dependency parsing to extract the syntactic structure and analyse the semantic aspect, including object quantities, to extract the scene graph. Then, we train the extracted objects, attributes, and relations in the scene graph and the corresponding geometric relation information using Graph Convolutional Networks, and it generates text representation which integrates textual and visual semantic information. The text representation is aggregated with word-level and sentence-level embedding to generate both visual contextual word and sentence representation. For the evaluation, we attached VICTR to the state-of-the-art models in text-to-image generation.VICTR is easily added to existing models and improves across both quantitative and qualitative aspects.


Introduction
Over the past decade, deep learning has achieved remarkable success in multimodal research problems, such as visual question answering, visual dialog, and image captioning. However, it is quite challenging to solve the text-to-vision multimodal tasks that deal with a text input and produce a visual output, such as text to image generation, text to video generation, or text to image retrieval. Generally, natural language (incl. text) is a more natural medium for a human to describe the image that they want to generate or retrieve. However, raw text includes only limited information to fully describe and represent an image.
For example, text to image generation tasks aim to generate photo-realistic images according to the given text descriptions. The current state-of-the-art (SOTA) text-to-image generation models (Xu et al., 2018;Zhang et al., 2017;Zhu et al., 2019) mainly focus on generating the high resolution images by applying generative adversarial networks (GAN) (Goodfellow et al., 2014) and rather neglect understanding input text descriptions. The most common text encoding approach in those SOTA models applies Recurrent Neural Networks (RNN) to extract global sentence-level and word-level embedding, and the last hidden state of RNN cells is used as a direct input to the GAN image generation model. However, RNNbased text encoding approaches are not capable of fully representing the rich visual semantics of input text descriptions in order to describe or generate photo-realistic visual output (e.g. image). For example, from the given sample text caption, A man on a skateboard with a brown dog outside, the last hidden state of RNN cells holds semantics and order of words and stores words' importance in the sentence. However, those information do not include most of the required information to describe/generate the image; for example, what objects are in the image (aspect of objects)? Where are those objects (position of objects)? How to represent the relations between objects (relation between objects)? OR for example, what objects are in the image (aspect of objects); where are those objects (position of objects); how to represent the relations between objects (relation between objects)?
The question is: What would be the best approach to extract the rich visual semantics of input text descriptions in order to describe/generate an image? In this paper, we introduce a new Visual Contextual Text Representation (VICTR: Visual Information Captured Text Representation), which represents the input text description with its visual semantics, as shown in Figure 1. The proposed model, VICTR, can be applied to diverse text-to-image multimodal tasks.
VICTR has five different modules: 1) Text to Scene Graph Parsing, 2) Scene Graph Embedding, 3) Positional Graph Embedding, 4) Visual Semantic embedding, 5) Visual Contextual Text Representation. First, we extract scene graphs from input text descriptions in order to define what objects, attributes and relations should be in the image. The scene graph is initially proposed by Johnson et al. (2015) in order to represent the objects, its attributes and their relations in the image. Inspired by this, we generate scene graphs from the raw text description by using dependency parsing and transformer-based object-attribute-relation classification. Then, we train the extracted object, attribute and relation nodes via Graph Convolutional Networks (GCN) in order to generate the visual contextual text representation, a word-level representation which incorporates textual syntactic and visual semantic information. Finally, it aggregates with word-level and sentence-level embedding respectively in order to generate a visual contextual word representation and visual contextual sentence representation.
For evaluating quantitative and qualitative aspects of our proposed model, we attach VICTR to the SOTA models in text-to-image generation. Thorough experiments on the COCO benchmark dataset demonstrate the superiority of VICTR with respect to both semantic consistency and visual reality. The main contribution is summarised in the end of the Sec. 2.3

Related Works and Contributions
We explore research trends in diverse text-to-vision multimodal tasks, which use text information as an only input to produce a visual output, including text to image generation and text to video generation.

Text to Image Generation
From 2016, text to image generation tasks have been explored by applying conditional GANs (Reed et al., 2016;Zhang et al., 2017; with the text caption as input. AttnGAN (Xu et al., 2018) is the first method that utilized a word-level embedding and fused it with image vectors in an attention mechanism to identify the contributing words of sub-regions in the generated images. Extended from AttnGAN, MirrorGAN (Qiao et al., 2019) applied the same attention mechanism on both sentence and word embedding to capture the global semantic consistency between generated images and input texts. SEGAN (Tan et al., 2019) proposed an adaptive attention mechanism on word-level embeddings to ensure only relevant words on the generated images would obtain attention-weight. SD-GAN  with a Siamese structure is utilized to guarantee the semantic alignment between generated images and captions. DM-GAN (Zhu et al., 2019) proposed a dynamic memory network to fuse word embedding and image representations for image generation. Most SOTA models applied bidirectional LSTM (RNN)-based text encoding, which contains only information that represents the order of the words and the words' importance in the given text caption. The output of the RNN-based text encoding is not enough to represent rich visual semantics in order to directly generate the image, hence, the images from current SOTA models are not successful in aligning with the given text description.

Text to Video Generation
Similar to text to image tasks, most text to video generation models learn the caption via RNN cells as conditional input representation to be used with Variational Autoencoders (VAE) or GANs in order to generate image frames. The main difference is that video generation considers how to model the temporal dependency by conditioning on the corresponding text captions. Sync-DRAW (Mittal et al., 2017) applies the recurrent attention-based VAE to create a temporally dependent sequence of frames but it still applies LSTM for input text encoding. GAN-based text to video generation approaches also apply RNN-based encoding to handle the input text captions. TGANs-C (Pan et al., 2017) utilises GANs with three discriminators to generate video based on input text captions encoded by Bi-LSTM. IRC- GAN (Deng et al., 2019) proposed a Mutual-information Introspection (MI) that measures the semantic similarity between text and generated video through a two-stage process. The conditional text input is represented through Bi-LSTM network. TFGAN (Balaji et al., 2019) applies a scheme via generating discriminative convolutional filters from text features and then convolves them with image features in the discriminator. It applies a CNN (Convolutional neural network)-based text encoder but it still does not represent the sufficient visual semantics from text captions in order to generate the video.

Main contribution
Most existing models for text-to-image and text-to-video generation tend to have a RNN or CNN-based sentence feature from the raw text for modeling the cross-model relation with the generated visual output. Hence, we now present our model, VICTR, the successful approach to extract the rich visual semantics of input text descriptions in order to describe/generate an image. The proposed VICTR is evaluated with text-to-image generation tasks. The model with VICTR outperforms the performance of original SOTA models in photo-realistic image generation based on text input. The major contributions of this work are summarised as follows: 1) The paper provides an example of capturing rich visual semantic and geometric relation information from raw text input. 2) The paper proposes a new visual information captured text representation for text-to-image generation tasks, which has not been reported before. The proposed text representation model can be usable with any text-to-vision multimodal tasks.

Methodology
As shown in Figure 1, the proposed visual contextual text representation, VICTR (Visual Information Captured Text Representation), mainly focuses on capturing and representing the visual semantic information (i.e. location or aspect of the object in the image, positional relation between objects) from raw text descriptions. This is crucial for text-to-image generation tasks. In summary, the architecture of the proposed VICTR is composed into five modules: 1) Text to Scene Graph Parsing, 2) Scene Graph Embedding, 3) Positional Graph Embedding, 4) Visual Semantic Embedding, 5) Visual Contextual Text Representation. Note that the first module text to scene graph can be considered as a pre-processing step to generate visual information embedded text representation for the rest of the architecture.

Text to Scene Graph Parsing
Based on the given raw text description, including image caption or scene description, we firstly extract a graph-based semantic representation, called scene-graph (Johnson et al., 2015), which explicitly repre-sents object instances, their attributes, and the relation between objects. This simple graph representation describes visual scenes/images in great detail. Inspired by this idea, we generate scene graphs based on input text descriptions. Like the nature of text-to-image generation tasks, we use only text descriptions in order to extract the scene graph with rich visual semantics (objects, attributes, relations of the image). In order to parse scene graphs from the given text caption, we firstly recognise the syntactic structure of the text descriptions by applying a universal dependency parser; in this research, we applied the Stanford enhanced dependency Parser (Chen and Manning, 2014). However, the output of a dependency parser would not be enough to directly represent the number of objects (as well as its attributes and relations between objects) that should be drawn in the scene graph. Hence, we have a semantic enhancement processing component, quantity checker. The quantity checker aims to detect the number of objects that the scene graphs need to include. For example, the following two text captions two men are riding brown horses and two men are riding a brown horse include different semantic information: the former would have two man objects and two brown horse objects but the latter could contain two man objects and one brown horse object. We duplicate the individual nodes in the dependency graph according to the value of their quantificational modifiers. In addition to this, we also cover some quantificational determiners by using the quantifier expression rule list, such as both of, a dozen of, or a lot of. From the syntactic and semantic integrated graph, we extract all nouns to classify into object classes, and retrieve all adjectives as attribute types of the specific object (pairwise classification). The relation between objects is detected if the word is the predicate or preposition of two different objects.
As a result, each text caption of an image can derive one scene graph G

Scene Graph Embedding
We convert the extracted scene graph to a vectorised graph representation to produce useful feature representations of nodes and edges in the object-attributes-relation networks. We apply GCNs to model the relative nearness of nodes and edges in the scene graph and preserve the visual semantics.
The basic relational graph G b represents visual semantic alignment between object and relation as well as object and attribute in scene graphs. We train the graph using GCNs to produce scene graph embeddings. As shown in Figure 1, each object o i ∈ O, relation r i ∈ R, and attribute a i ∈ A is made as a node of the graph G b . Then the object-to-relation connection and relation-to-object connection are represented as edges e o i →r j and edges e r j →o i respectively. Similarly, edge e o i →at indicates the objectattribute alignment. For edge e o i →r j , e r j →o i , and e o i →at , the weight is calculated based on the equations: The edge weight to the node itself would be 1. The edge weights are compiled into an adjacency matrix combined with the graph degree matrix and are passed into a 2-layer GCN to be trained through mapping each object to its corresponding super-class. We denote node embeddings for an object, an attribute and a relation as EB o , EB a , EB r ∈ R B .

Positional Graph Embedding
In section 3.2, the scene graph-based basic relational graphs mainly focus on the semantic relations (predicates or preposition) between objects, e.g. ride from the text description man rides a horse. It provides the lingual semantics of objects and relations but it is not still enough to fully describe the image Figure 2: Two-dimensional PCA projection of 1200-dimensional visual semantic vectors of objects and 500-dimensional position-enhanced vectors of relations. The figures illustrate ability of the model to automatically organise concepts and learn implicitly the similarity between them, as during the training we did not provide any supervised information but only the text description of images in COCO2014.
with geographical information, such as location of objects or the relative position (e.g. left to) between objects, which includes an indicative and explicit location of one object in relation to another. Hence, we propose a position-enhanced relational graph G p that focuses on visual semantics of relations between objects. Six relative geometric relations are chosen and denoted as p ∈ {left of, right of, above, below, inside, surrounding} to represent edges e o i →r j and edges e r j →o i . To train these edges, the geometric relation is detected based on the gap between bounding boxes of one to another object. Considering that one object may correspond to multiple geometric relations, we generate individual graphs for each type of geometric relations. The weights of edges e o i →r j and edges e r j →o i in a graph G p of six geometric indicators are calculated as those in the basic relational graph. For each graph, the edge weights are compiled into an adjacency matrix combined with the graph degree matrix, and passed into a 2-layer GCN to train each object with its corresponding super-class. The object-level and relation-level node embedding in each of the six position-enhanced relational graphs are concatenated and produces the positional object-level node embedding EP o ∈ R P and relation-level node embedding EP r ∈ R P .

Visual Semantic Embedding
We now integrate object, relation, attribute-level embeddings from basic relational graphs and position enhanced graphs in order to produce the comprehensive visual semantic embedding for scene graphs. The visual semantic embedding is composed to three aspects: 1) Object-level embedding E o ∈ R B+P : concatenate EB o and EP o of object o i ∈ O, 2) Relation-level embedding E r ∈ R B+P : combine EB r and EP r for each relation r j ∈ R, 3) Attribute-level embedding E a = EB a ∈ R B . For each object in one scene graph, the object embedding is concatenated with its attribute embedding as well as the corresponding relation embedding to produce the final visual semantic embedding E vs ∈ R 2 * (B+P )+B . Based on the produced final visual semantic embedding, we now visualise the ability of the proposed embedding model to automatically organise different aspects of objects and learn implicitly the relations between them. Figure 2 illustrates the visual semantic vectors of diverse objects and the positionenhanced relation vectors that appear frequently with those objects from the COCO2014. In Figure 2(a), animal objects cat, dog are close to each other while being far away from the electronics objects, such as mouse and TV. These electronics objects are close to the relation place because it is commonly used with them instead of the relation sit or stand. Similarly, the relations park, dock are close to the group of vehicle objects truck, boat, train but far from the kitchenware object cluster in Figure2(b). This pattern can be also found in Figure2(c) as it is shown that the food objects are gathered together.

Visual Contextual Text Representation
The proposed visual semantic embedding strongly integrates the semantic information of an object with its attributes and the relations attached to it as well as the positional (geometric) relations between the object and others. In order to seamlessly grain this visual semantic information into the text represen-tation, we integrate it into the word and sentence representation from raw text using attention mechanism. The attention mechanism is applied between the E vs and the corresponding text word embedding E word ∈ R L×D , which is derived from the text encoder with the sequence length L and the dimension of word embedding D. The attention is inspired by (Vaswani et al., 2017), and we made E vs as both K and V while taking E word as Q: Here W is a learnable weight to map the word representation to the visual semantics space. The attended visual semantic embedding E vs ∈ R L×2 * (B+P )+B represents the importance of each object-based visual semantic information to each word in the sequence of a text caption. The E vs is concatenated with the word embedding E word to get the visual contextual word representation E V ICT R−W . Similarly, the object-information over all the words are summed up via E vs , and then concatenated with the sentence embedding E sent ∈ R D derived from the text encoder to get the visual contextual sentence representation E V ICT R−S .

Evaluation Setup
Baselines Three text-to-image generation SOTA models, StackGAN (Zhang et al., 2017), AttnGAN (Xu et al., 2018), and DM-GAN (Zhu et al., 2019) were selected as baselines, which all used text representation as the only source for image generation. We replaced the original text representation with our proposed VICTR and compared the generated images.
Dataset We evaluated the model performance on COCO2014 (Lin et al., 2014) 2 , which is the most common benchmark and contains photo-realistic images with diverse objects and relations. Detailed dataset statistics are shown in Table 1. Each image has 5 corresponding image descriptions and we selected the caption which generates the richest scene graph. We used bounding box features to train the geometric relations of multiple objects.    Evaluation Metrics We use Inception Score (IS), Frechét Inception Distance (FID) and R-precision to quantitatively evaluate the model performance on 30,000 generated images 3 . IS (Salimans et al., 2016) uses Kullback-Leiber (KL) divergence to compare the similarity between each generated image label probability distribution and the marginal probability distribution of all generated images, the higher the IS, the better the model is to generate diverse and distinct images. FID (Heusel et al., 2017) is an improved version of IS, comparing the Frechét distance between the maximum entropy distribution of the generated images and the real images. The lower the FID, the more similar the generated images to the real images. R-precision measures the consistency between the generated image and the input text. We followed Xu et al. (2018) and set R = 1, comparing the cosine similarity between generated image vector and input text embedding to find the top r captions that are relevant to images and calculate R-precision as r/R. The final score is taken as the average of R-precision of all images, the higher the score, the better consistency between generated images and captions. Table 2 shows the performance of IS, FID and R-precision of the SOTA models, and the corresponding improvement with VICTR. Applying the VICTR-S feature in StackGAN improved the overall IS by around 1.93, which indicates the higher quality of final generated images. Specifically, the original StackGAN achieved 8.45 on IS with 600 epochs on stage-II, while the model with VICTR outperformed the original model at only 130 epochs. For AttnGAN and DM-GAN, we applied the VICTR-S feature in the initial image generation and VICTR-W feature for the iterative refinement. There is a clear improvement of all three metrics for both models with VICTR. The improvement in FID shows that using visual semantic relations between objects actually helped to form a group of objects in the geographically simi- lar position to those in the ground truth images. Moreover, VICTR was mined from the original text and aligned the lingual semantics and visual semantics in the image captions. It helps the model to generate images which are better aligned to the text captions and leads to the increase of R-precision.

Visual Comparison
The visual comparisons between three SOTA baselines, and those with the proposed VICTR are presented in Figure 3. There are several findings from the visual comparison: firstly, with VICTR, images show a clearer structure (appearance of objects and their relative positions) and are also closer to the ground-truth than those generated by the original SOTA model. For example, the column 3 in Figure  3(a) generated by StackGAN with VICTR has the similar structure of ground truth image that each object beach, ocean, and sky are positioned from the bottom to top, as well as a kite flying in the sky and a man standing in the beach. Secondly, compared to images from original models, the VICTR-based images provide clearer object shapes so the objects are relatively easier to be recognised, (i.e. food, plate, sheep, cat and human from column 1,2 in Figure 3(a), column 1,2 in Figure 3(b) and column 1 in Figure 3(c)). Moreover, VICTR supports the model to well-understand contents in the text caption: 1) more objects from the text are identified in the image (e.g. the object cat from column 2 in Figure 3(b) and the man/parasail from column 1 in Figure 3(c) are completely missing without VICTR). 2) VICTR is good at handling quantifiers into individual objects. a flock of sheep are well captured by VICTR at column 1 in Figure 3(b) where the original model failed to identify the number of objects. 3) even when the ground-truth image does not match with the caption, the VICTR-based models can generate images that are consistent with the caption, shown in column 3 and 2 in in Figure 3(b) and (c) respectively. Figure 4 indicates that VICTR-based model is able to generate better initialised images and refine them to be more related to the given text caption. In the baseline models, the initial stage image generation with the sentence-level feature captures the major frame or very rough appearance of objects identified from the text, whereas the image refinement process only focuses on the word-feature to polish the initial image but makes no major scene changes. From the images generated by original models, we found that:

Ablation Study -Cascaded Generators
(1) the sentence-level feature from the Bi-LSTM encoder at the initial stage is not enough to produce the precise main image structure as described in the original text caption, so the models tend to create mistakes; and (2) the word feature in the following refinement process is not enough to amend these mistakes (from the initial stage), which limits the quality of final images. For example, the 2nd caption in Figure 4(a) and 4(b) describes two kites sailing in the sky. However, both the initial images generated by the original AttnGAN and DM-GAN capture only one kite in the sky and this error propagates to the final image. In comparison, in the images generated with VICTR, two kites appeared from the initial stages, which matches the caption well and this persists all the way to the final image generation. Similar pattern can be found from the 1st caption in Figure 4(a) where the original AttnGAN fails to well capture the positional relationship in the background between the object street and buildings as well as the 1st caption in Figure 4(b) from which the object train and water are not drawn clearly and not well positioned in relation to each other in the initial image, leading to the low quality of final image.

Refinement Attention Inspection
We visualise the parsed scene graph, and the intermediate images and attention maps of each refinement stage in Figure 5. Several improvements can be observed in the word-image attention that better reflects the visual-linguistic alignment of objects and their positions: 1) The model with VICTR can focus on the more relevant and important object region in the image while using the corresponding word feature for the refinement. For example, in Figure 5(a), the model with VICTR highlights the words a, couple and elephant to generate two elephants in the image whereas the original models do not. The similar pattern can be found with the words flower and tree in Figure 5(b).
2) The positional relation attention represents a semantically meaningful visual context alignment on the linguistic relation expressed in the text description. This can be easily observed by the attention of words standing in from Figure 5(a).

Human Evaluation
We conducted a human evaluation with 50 participants to qualitatively evaluate VICTR in the consistency between generated images and captions. The results and examples are in the Appendix 4 .

Conclusion
In this paper, we proposed a new visual contextual text representation for text-to-image multimodal tasks, called VICTR, which extracts rich visual semantic information from input text descriptions. We have shown improvement across both quantitative and qualitative aspects when applying VICTR to diverse SOTA models in text-to-image generation. We also present an analysis showing the ability of VICTR to automatically organise different aspects of objects and learn the relations between them. The human evaluation results show that VICTR produces images that are highly aligned with text captions and very realistic. It is hoped that VICTR provides the insight into future integration of text handling in text-tovision multimodal tasks.