Unsupervised Keyword Extraction for Full-Sentence VQA

In the majority of the existing Visual Question Answering (VQA) research, the answers consist of short, often single words, as per instructions given to the annotators during dataset construction. This study envisions a VQA task for natural situations, where the answers are more likely to be sentences rather than single words. To bridge the gap between this natural VQA and existing VQA approaches, a novel unsupervised keyword extraction method is proposed. The method is based on the principle that the full-sentence answers can be decomposed into two parts: one that contains new information answering the question (i.e. keywords), and one that contains information already included in the question. Discriminative decoders were designed to achieve such decomposition, and the method was experimentally implemented on VQA datasets containing full-sentence answers. The results show that the proposed model can accurately extract the keywords without being given explicit annotations describing them.


Introduction
Visual recognition is one of the most actively researched fields; this research is expected to be applied to real-world systems such as robots. Since innumerable object classes exist in the real world, training all of them in advance is impossible. Thus, to train image recognition models, it is important for real-world intelligent systems to actively acquire information. One promising approach to acquire information on the fly is learning by asking, i.e., generating questions to humans about unknown objects, and consequently learning new knowledge from the human response (Misra et al., 2018;Uehara et al., 2018;Shen et al., 2019). This implies that if we can build a Visual Question Answering (VQA) system (Antol et al., 2015) that functions in the real Figure 1: Example of the proposed task -keyword extraction from full-sentence VQA. Given an image, the question, and the full-sentence answer, the keyword extraction model extracts a keyword from the full-sentence answer. In this example, the word "candles" is the most important part, answering the question "What is in front of the animal that looks white?". Therefore, "candles" is considered as the keyword of the answer. world and extracts knowledge from human responses, we can realize an intelligent system that can learn autonomously.
VQA is a well-known vision and language task which aims to develop a system that can answer a question about an image. One typical dataset used in VQA is the VQA v2 dataset (Goyal et al., 2017). The answers in the VQA v2 dataset are essentially single words. This is because the annotators are instructed to keep the answer as short as possible when constructing the dataset.
The ultimate goal of the present work is to gain knowledge through VQA that can be easily transferred to other tasks, such as object class recognition and object detection. Therefore, the knowledge (VQA answers) should be represented by a single word, such as a class label. However, in real-world dialog, answers are rarely ex-pressed by single words; rather, they are often expressed as complete sentences. In fact, in VisDial v1.0 (Das et al., 2017), a dataset of natural conversations about images that does not have a word limit for answers, the average length of answers is 6.5 words. This is significantly longer than the average length of the answers in the VQA v2 dataset (1.2 words).
To bridge the gap between existing VQA research and real-world VQA, a challenging problem must be solved: identifying the word in the sentence that corresponds to the answer to the question. It must also be considered that fullsentence answers provided by humans are likely to follow a variety of sentence structures. Thus, the traditional approaches, such as rule-based approaches based on Part-of-Speech tagging or shallow parsing, require a great deal of work on defining rules in order to extract the keywords. Our key challenge is to propose a novel keyword extraction method that leverages information from images and questions as clues, without the heavy work of annotating keywords or defining the rules. This work handles the task of extracting a keyword when a full-sentence answer is obtained from VQA (Full-sentence VQA). The simplest approach to this task is to construct a dataset containing full-sentence answers and keyword annotations, and then train a model based on this dataset in a supervised manner. However, the cost of constructing a VQA dataset with full-sentence answers and keyword annotations is very high. If a keyword extraction model can be trained on a dataset without keyword annotations, we can eliminate the high cost of collecting keyword annotations.
We propose an unsupervised keyword extraction model using a full-sentence VQA dataset which contains no keyword annotations. Here, the principle is based on the intuition that the keyword is the most informative word in the full-sentence answer, and contains the information that is not included in the question (i.e., the concise answer). Essentially, the full-sentence answer can be decomposed into two types of words: (1) the keyword information that is not included in the question, and (2) the information that is already included in the question. For example, in the answer "The egg shaped ghost candles are in front of the bear." to the question "What is in front of the animal that looks white?", the word "candles" is the keyword, while the remaining part "The egg shaped ghost something is in front of the bear" is either information already included in the question or additional information about the keyword. In this case, words like "egg," "ghost," and "bear" are also not in the question, making it difficult to find the keyword via naive methods, e.g., rule-based keyword extraction. Our proposed model utilizes image features and question features to calculate the importance score for each word in the fullsentence answer. Therefore, based on the contents of the image and the question, the model can accurately estimate which words in the full-sentence answer are important. To the best of our knowledge, this is the first attempt at extracting a keyword from full-sentence VQA in an unsupervised manner. The main contributions of this work are as follows: (1) We propose a novel task of extracting keywords from full-sentence VQA with no keyword annotations. (2) We designed a novel, unsupervised keyword extraction model by decomposing the full-sentence answer. (3) We conducted experiments on two VQA datasets, and provided both qualitative and quantitative results that show the effectiveness of our model.

Unsupervised Keyword Extraction for Text
Unsupervised keyword extraction methods can be broadly classified into two categories: graphbased methods and statistical methods. Graph-based methods construct graphs from target documents by using co-occurrence between words (Mihalcea and Tarau, 2004;Wan and Xiao). These methods are only applicable to documents with a certain length, as they require the words in the document to co-occur multiple times. The target document in this work is a full-sentence answer of VQA, whose average length is about 10 words. Therefore, graph-based methods are not suitable here.

Question
Full-sentence answer Figure 2: Illustration of the key concept. In this example, the word "candles" is the keyword for the full-sentence answer, "The egg shaped ghost candles are in front of the bear." We consider the keyword extraction task as the decomposition of the full-sentence answer into answer information and question information. Therefore, if the keyword (i.e., the most informative word in the full-sentence answer) can be accurately extracted, the original fullsentence answer can be reconstructed from it. Additionally, the question can be reconstructed from the decomposed question information in the full-sentence answer.
between the candidate word (or phrase) embeddings and the sentence embeddings to retrieve the most representative word of the text.

Visual Question Answering
VQA is a well-known task that involves learning from image-related questions and answers. The most popular VQA dataset is VQA v2 (Goyal et al., 2017), and much research has used this dataset for performance evaluations. In VQA v2, the average number of words in an answer is only 1.2, and the variety of answers is relatively limited.
As stated in Section 1, in natural question answering by humans, the answers will be expressed as a sentence rather than a single word. Some datasets that have both full-sentence answers and keyword annotations exist.
FSVQA (Shin et al., 2016) is a VQA dataset with answers in the form of full sentences. In it, full-sentence answers are automatically generated by applying the numerous rule-based natural language processing patterns to the questions and single-word answers in the VQA v1 dataset (Antol et al., 2015).
The recently proposed dataset, named GQA (Hudson and Manning, 2019), also contains automatically generated full-sentence answers. This dataset is constructed on the Visual Genome (Krishna et al., 2017), which has rich and complex annotations about images, including dense captions, questions, and scene graphs. The questions and answers (both single-word and full-sentence) in the GQA dataset are created from scene graph annotations of the images.
The full-sentence answers in both datasets described above are annotated automatically, i.e., not by humans. Therefore, neither dataset has both full-sentence answers and manually annotated keywords.

Attention
The attention mechanism is a technique originally proposed in machine translation (Bahdanau et al., 2015), aimed at focusing on the most important part of the input sequences for a task. Since the method proposed herein utilizes an attention mechanism to calculate the importance score of the word in the full-sentence answer, some prior works on attention mechanisms are discussed.
In general, an attention mechanism essentially learns the mapping between a query and key-value pairs. Transformer (Vaswani et al., 2017) is one of the most popular attention mechanisms for machine translation. It enables machine translation without using recurrent neural networks, using a self-attention mechanism and feed-forward networks instead.
Another study uses an attention mechanism for weakly supervised keyword extraction (Wu et al., 2018). They first trained a model for document classification and extracted the word to which the model pays "attention" to perform the classification. This system requires additional annotations of document class labels to train the model, whereas we aim to extract keywords without any additional annotations.
answer, each representing the keyword information and the information derived from the question, respectively. To ensure that these two features discriminatively include keyword information and question information, we intend to reconstruct the original questions and answers from the question features and keyword features, respectively. Thus, if we successfully extract the keyword and the question information from the full-sentence answer, we can reconstruct original full-sentence answer and the question. Essentially, given an image, its corresponding question, and full-sentence answer, our proposed model extracts the keyword of the answer by decomposing the keyword information and the question information in the answer.

Overview
An overview of the model is shown in Figure 3. To realize decomposition-based keyword extraction, we designed a model which consists of the encoder E, the attention scoring modules S a and S q , and the decoder modules D all , D a , and D q .
An image I and the corresponding question Q and full-sentence answer n } are considered as the model input. Here, w (a) i represents the i-th word in the full-sentence answer.
Given I and Q, E extracts image and question features and integrates them into joint features f j , i.e., E(I, Q) = f j .
Next, S a and S q use f j and A as input and output the weight vectors a k and a q . Here, n } for each word in A. We denote a i ∈ (0, 1) as the weight score of the i-th word in A.
Then, we consider the keyword vector f k as the embedding vector of the word with the highest weight score in a k . Meanwhile, the question information vector f q is considered as the weighted sum of the embedding vectors of A corresponding to the weight score a q .
Following this, D all uses LSTM to reconstruct the original full-sentence answer using f q and f k . f q and f k are intended to represent the question information and the keyword vector of the fullsentence answer, respectively. However, D all only ensures that both features have the information of the full-sentence answer. To separate them, we designed the additional decoders, D a and D q . The former reconstructs the BoW features of the answer using f k , while the latter reconstructs those of the question using f q with auxiliary vectors. The objective of this operation is to make f k and f q representative features for the full-sentence answer and the question, respectively.
The entire model is trained to minimize the disparity between the reconstructed sentences A recon and the original full-sentence answers, as well as that between the BoW features of the full-sentence answers and the questions.

Encoder
The module E encodes the image I and the question Q and obtains the image feature f I , the question feature f Q , and the joint feature f j . To generate f I , we use the image feature extracted from a deep CNN, which is pre-trained on a large-scale image recognition dataset. For f Q , each word token was converted into a word embeddings and averaged. Following this, l 2 normalization was performed on both features. Finally, those features were concatenated to the joint feature where d j is the dimension of the joint feature and [; ] indicates concatenation. Note that we did not update the model parameters of E during training.

Attention Scoring Module
This module takes f j as input and weights each words in the full-sentence answer. We used two of these modules, S a and S q . S a and S q compute the weights based on the importance of a word for the full-sentence answer and that for the question, respectively. S a and S q have a nearly identical structure. Therefore, the details of S a are presented first, following which the difference between S a and S q is described.
The weight scoring in these modules is based on the attention mechanism used in Transformer (Vaswani et al., 2017).
First, each word in the full-sentence answer was encoded, and the full-sentence answer vector f A = {w Here, w (a) i denotes the embedding vector of the i-th word, n is the length of the full-sentence answer, and d e is the dimension of the word embedding vector. To represent the word order, positional encoding was applied to f A . Specifically, before feeding f A into scoring modules, we add positional embedding vectors to f A , similar to Image What is in front of the animal that looks white?

Question
The egg shaped ghost candles are in front of the bear. We describe our attention mechanism as a mapping between Query and Key-Value pairs. First, we calculate Query vector Q ∈ R h , Key vector K ∈ R h×n , and Value vector V ∈ R h×n . i is the weighted score of the i-th word, is computed as the product of Q and K, as shown below.

Answer
Then, the word with the highest weighted score is chosen as the keyword of the full-sentence answer: However, the argmax operation is nondifferentiable. Therefore, we use an approximation of this operation by softmax with temperature.
where τ is a temperature parameter, and as τ approaches 0, the output of the softmax function becomes a one-hot distribution. S q has the same structure as S a up to the point of computing the attention weight vector a q . For the keyword vector, we have the intention to focus on the specific word in the full-sentence answer. Therefore, we use the softmax with temperature. However, for the question vector, there is no need to focus on one word. Therefore, the question vector is calculated as the weighted sum of the attention score: Then, we applied single-layer feed-forward neural network, followed by layer normalization (Ba et al., 2016) to the output of this module f k , f q .

Decoder
Entire Decoder In the entire decoder D all , the full-sentence is reconstructed from the output of the attention scoring modules f k and f q , i.e., A recon = D all (f k , f q ), where A recon denotes the reconstructed full-sentence answer. We use an LSTM as the sentence generator. As the input to the LSTM at each step (x t ), f k and f q are concatenated to the output of the previous step as follows: whereŝ t−1 is the output of the LSTM at the t − 1 step, and W x 0 and W x are the learned parameters.
The objective of D all is defined by the crossentropy loss: where s (ans) is the ground-truth full-sentence answer.
Further, word dropout (Bowman et al., 2016), a method of masking input words with a specific probability, is applied. This forces the decoder to generate sentences based on the f k and f q rather than relying on the previous word.
Discriminative Decoders D all attempts to reconstruct the full-sentence answer from f k and f q . Thus, D all allows the feature vectors to contain the answer information. However, the keyword and question information are intended to be represented by f k and f q , respectively. Therefore, we designed the discriminative decoders, D a and D q , to generate f k and f q , respectively, thus capturing the desired information separately. D a and D q reconstruct the full-sentence answer and the question, respectively. This reconstruction is performed with the target of the BoW features of the sentence, rather than the sentence itself. This is because we intend to focus on the content of the sentence and not its sequential information. Sentence reconstruction was also considered as an alternative, but this is difficult to train using LSTM. The BoW feature b ∈ R ns is represented as a vector whose i-th elements is N i /L s , where n s is the vocabulary size, N i is the number of occurrences of the i-th word, and L s is the number of the words in the sentence.
The input to these discriminative decoders consists not only of feature vectors, but also auxiliary vectors, the additional features that assist in reconstruction. Specifically, the auxiliary vector for D a is the average of the word embedding vectors in the question, f Q , and, for D q , the auxiliary vector is the image feature f I . We build the decoder as the following fullyconnected layers: The loss function for the discriminative decoder is the cross-entropy loss between the ground-truth BoW features and the predicted BoW features: where b denotes the ground-truth of the BoW features, and n a and n q are the vocabulary sizes of the answer and the question, respectively.

Full Objectives
Finally, the overall objective function for the proposed model is written as where λ all , λ a , and λ q are hyper-parameters that balance each loss function.

Implementation Details
In the encoder E, image features of size 2048 × 14 × 14 were extracted from the pool-5 layer of the ResNet152 (He et al., 2016). These were pre-trained on ImageNet, and global pooling was applied to obtain 2048dimensional features. To encode the question words, we used 300-dimensional GloVe embeddings (Pennington et al., 2014). These were pretrained on the Wikipedia / Gigaword corpus 1 .
To convert each word in the full-sentence answer into f A , the embedding matrix in the attention scoring module was initialized with the pretrained GloVe embeddings. The temperature parameter τ is gradually annealed using the schedule τ i = max(τ 0 e −ri , τ min ), where i is the overall training iteration, and other parameters are set as τ 0 = 0.5, r = 3.0×10 −5 , τ min = 0.1. The LSTM in the D all has a hidden state of 1024 dimensions. The word dropout rate was set to 0.25.
We used the Adam (Kingma and Ba, 2015) optimizer to train the model, which has an initial learning rate of 1.0 × 10 −3 . We conducted experiments on two datasets: GQA and FSVQA. In Table 1, we present the basic statistics of both datasets. GQA GQA (Hudson and Manning, 2019) contains 22M questions and answers. The questions and answers are automatically generated from image scene graphs, and the answers include both the single-word answers and the full-sentence answers. The questions and answers in GQA have unbalanced answer distributions. Therefore, we used a balanced version of this dataset, which is down-sampled from the original dataset and contains 1.7M questions. As pre-processing, we removed the periods, commas, and question marks. FSVQA FSVQA (Shin et al., 2016) contains 370K questions and full-sentence answers. This dataset was built by applying rule-based processing to the VQA v1 dataset (Antol et al., 2015), and captions in the MSCOCO dataset (Lin et al., 2014), to obtain the full-sentence answers. There are ten annotations (i.e., single-word answers) per question in the VQA v1 dataset. Of these, the annotations with the highest frequency is chosen to create full-sentence answers. If all the frequencies are equal, an annotation is chosen at random. Since the authors do not provide the mapping between single-word answers and fullsentence answers, we considered the annotations with the highest frequency as the single-word answers matching the full-sentence answers. Questions for which the highest frequency annotation cannot be determined were filtered out. Following this process, we obtained 139,038 questions for the training set, and 68,265 questions for the validation set.

Settings
The model performance was determined based on the keyword accuracy and the Mean Rank. Mean Rank is the average rank of the correct keyword when sorting each word in order of the importance score. Mean Rank is formulated as: Here, rank i is the number representing the keyword rank when the words in the i-th answer sentence are arranged in order of the importance (TF-IDF score or attention score, i.e., a (k) i in Eqn. 5), and N is the size of the overall samples.
We ran experiments with the various existing unsupervised keyword extraction methods for the comparison: (1) TF-IDF (Ramos, 2003), (2) YAKE (Campos et al., 2020), and (3) Em-bedRank (Bennani-Smires et al., 2018). Since YAKE removes the words with less than three characters as preprocessing, the Mean Rank cannot be calculated under the same conditions as other methods. Therefore, the Mean Rank of YAKE is not shown. We also conducted an ablation study to show the importance of D a and D q . In addition, we changed the reconstruction method from BoW estimation to the original sentence generation using LSTM.

Experimental Results
The experimental results are shown in Table 2. Also, we provide the accuracy per question types in Appendix A for further analysis. The proposed model, which used BoW estimation in D a and D q , achieves superior performance on almost all metrics and datasets except for the Mean Rank of FSVQA. As can be seen in the results of the ablation study, this superior performance is achieved even without D a and D q , which demonstrates the effectiveness of the proposed reconstruction-based method. When using LSTM in D a and D q , the accuracy and mean rank worsens as compared to those of the proposed model, which reconstructs the BoW in those modules. This is considered to be because sentence reconstruction with LSTM requires management of the sequential information of the sentence, which is more complex than BoW estimation. Since we intended to focus on the contents of the sentence, the BoW is more suitable for these modules.   We provide some examples in Figure 4. The examples on the left and right are from GQA and FSVQA, respectively. Since the statistical methods such as TF-IDF tend to choose rarer words as keywords, they are likely to fail if the keyword is a common word (Figure 4 (a), (c)). On the other hand, the model proposed herein can accurately extract keywords even in such cases.

Conclusion
In this paper, we proposed the novel task of unsupervised keyword extraction from full-sentence VQA. A novel model was designed to handle this task based on information decomposition of fullsentence answers and the reconstruction of questions and answers. Both qualitative and quantitative experiments show that our model successfully extracts the keyword of the full-sentence answer with no keyword supervision.
In future work, the extracted keywords will be utilized in other tasks, such as VQA, object classification, or object detection. This work could also be combined with recent works on VQG (Uehara et al., 2018;Shen et al., 2019). In these works, the system generates questions to acquire information from humans. However, they assume that the answers are obtained as single words, which will pose a problem when applying it to the real-world question answering. By combining these studies with our research, an intelligent system can ask humans about unseen objects and learn new knowledge from the answer, even if the answer consists of more than a single word.