Finding the Evidence: Localization-aware Answer Prediction for Text Visual Question Answering

Image text carries essential information to understand the scene and perform reasoning. Text-based visual question answering (text VQA) task focuses on visual questions that require reading text in images. Existing text VQA systems generate an answer by selecting from optical character recognition (OCR) texts or a fixed vocabulary. Positional information of text is underused and there is a lack of evidence for the generated answer. As such, this paper proposes a localization-aware answer prediction network (LaAP-Net) to address this challenge. Our LaAP-Net not only generates the answer to the question but also predicts a bounding box as evidence of the generated answer. Moreover, a context-enriched OCR representation (COR) for multimodal fusion is proposed to facilitate the localization task. Our proposed LaAP-Net outperforms existing approaches on three benchmark datasets for the text VQA task by a noticeable margin.


Introduction
Visual Question Answering (VQA) has attracted much interest from the communities and witnessed tremendous progress. However, lacking the ability to generate answers based on texts in the image limits its applications. Recently, many new datasets (Biten et al., 2019a; and new methods Hu et al., 2020) are proposed to tackle this challenge and refer it as text VQA.
The earliest method for text VQA is LoRRA , which provides an optical character recognition (OCR) module for the VQA input and proposes a dynamic copy mechanism to select the answer from both fixed vocabulary and OCR words. The following work M4C (Hu et al., 2020) inspired by LoRRA, uses rich representations of OCR as input and utilizes dynamic pointer network to deal with out-of-vocabulary answers, leading to state-of-the-art performance. However, M4C simply concatenates all modalities as transformer input and does not consider the high-level interaction among modalities of text VQA. Moreover, it is unable to provide evidence for the answer since the text is not localized in the image. Another recent work (Wang et al., 2020) proposes a new dataset for evidence-based text VQA, which suggests Intersection over Union (IoU) based evaluation metric to measure the evidence. Our work follows the spirit of evidence-based text VQA. More specifically, we generate the answer text bounding box during the answer prediction process as supplementary evidence for our answer. We propose a localization-aware answer prediction module (LaAP) that integrates the predicted bounding box with our semantic representation for the final answer. Besides, we propose a multimodal fusion module with context-enriched OCR representation, which uses a novel position-guided attention to integrate context object features into OCR representation.
The contributions of this paper are summarized as follows: 1) We propose a LaAP module, which predicts the OCR position and integrates it with the generated answer embedding for final answer prediction. 2) We propose a context-enriched OCR representation (COR), which enhances the OCR modality and simplifies the multimodal input. 3) We show that the predicted bounding box can provide evidence for analyzing network behavior in addition to improving the performance. 4) Our proposed LaAP-Net outperforms state-of-the-art approaches on three benchmark text VQA datasets, TextVQA , ST-VQA(Biten et al., 2019b) and OCR-VQA (Mishra et al., 2019), by a noticeable margin.

Text Visual Question Answering
Text VQA has attracted much attention from the communities. The predominate method is LoRRA , which takes image features, OCR features and questions to generate the answer. LoRRA mimics the human answering process by providing the image-looking module, text-reading module and answer-reasoning module. The generated answer could be selected from a fixed answer vocabulary or one of the OCR tokens by the copy module. The copy module is further improved by M4C (Hu et al., 2020) using dynamic pointer network. The M4C also proposes a transformer based network with 3 multi-modal input (question, image object features and OCR features). We share the same spirit as M4C but split the network into a clear encoder-decoder structure. We further propose a context-enriched OCR representation to extract OCR related image features.

Evidence-based VQA and Multitask Learning
Evidence-based VQA has been proposed in the recent work (Wang et al., 2020), which suggests to use intersection over union (IoU) to indicate the evidence. Many existing works (Selvaraju et al., 2017;Goyal et al., 2016;Gao et al., 2019) compute the attention scores and build spatial maps on image to highlight regions, which the model focuses on. The spatial maps serve as an evidence and visual explanations of a VQA architecture. Our method further extends this by designing a location predictor to generate a bounding box on the image to explain the answer generated. The bounding box explains that the correct answer generated is based on the analysis of underlying reasoning instead of exploiting the statistics of the dataset. As such, the bounding box becomes evidence of the VQA answer. To achieve the aforementioned target, we design a multitask learning process, which not only generates the answer based on the image and question but also provides the bounding box for the answer. The proposed method improves the interpretation of VQA results and leads to better performance.

LaAP Network Architecture
To better utilize the position information of image texts and enforce the network to better exploit visual features, we propose a localization-aware answer prediction network (LaAP-Net). Our LaAP-Net is built based on the multimodal transformer encoder, transformer decoder and localization-aware prediction network as shown in Figure 1. The transformer encoder takes the question embedding and OCR embedding as input. Question embedding is generated by putting the question through a pretrained BERT-based model, whereas the OCR embedding is generated by our proposed context-enriched OCR representation module. As highlighted in dark yellow in Figure 1, the decoding process starts with the <begin> signal. For each decoded output, we first generate a bounding box. This bounding box will then be embedded and added to the current answer decoder output, which is referred as localization-aware answer representation. Finally, it is fed to the vocabulary score module and OCR score module. The scores are concatenated and the element with the maximum score is selected as the final answer. In the following section, we will present the three components of LaAP-Net: the context-enriched OCR representation, the localization-aware predictor and the transformer with simplified decoder.

Context-enriched OCR Representation
Existing work (Hu et al., 2020) builds a common embedding space for all modalities. However, this common embedding space has difficulty utilizing the image object features. We observe this by training the M4C (Hu et al., 2020) network without the image object modality. The accuracy is almost unaffected. To better exploit the image object modality, we propose the context-enriched OCR representation (COR) module (Shown in Figure 2). Ideally, the answer for text VQA should be found from OCR tokens, thus we  Figure 1: An overview of LaAP model. We perform context-enriched OCR representation to extrat object features. Then question words and enriched OCR tokens are input to the transformer encoder and the transformer decoder. Based on the transformer decoder outputs, we first predict the answer localization, and then integrate this localization to the OCR embedding. Decoder output is also equipped with OCR position embedding. The OCR scores and vocabular scores are calculated accordingly to find the answer from an OCR token or a word from the fixed answer vocabulary.
integrate geometric context objects of an OCR token into its representation to improve the discriminative power. Take . We embed the given question into a set of word embedding x ques k (where k = 1, ..., K and K is the number of words) through a pretrained BERT language model (Devlin et al., 2019). All embeddings are then linearly projected to a d-dimensional space.
The detailed computation process for COR is described as follows. Firstly, the position-guided attention score vector att m between the m-th OCR token and the image objects is calculated as where W Q and W K are query projection matrix and key projection matrix respectively. Then the m-th image attended OCR representation is calculated as weighted sum of the N object feature vectors as Note that we omit the multi-head attention mechanism (Vaswani et al., 2017) for simplicity. Finally, each OCR token is represented by aggregating OCR feature embedding, image attended OCR representation and position embedding asx where W ocr is a matrix that linearly project the bounding box coordinate vector to d dimension. With the proposed attention, the image object modality is merged into OCR. We then feedx ocr 1 ,...,x ocr M and x ques 1 , ..., x ques K into the transformer encoder as input. The strengthened OCR representationx ocr m empowers the network to better learn the semantic correlation between OCR tokens and question. Meanwhile, it simplifies the multimodal feature input to improve the localization-aware answer prediction.

Localization-aware Predictor
To exploit the positional information of image features and texts, we design a localization-aware predictor to perform the bounding box prediction. The bounding box is embedded and added to the decoder output to generate the localization-aware answer representation. More specifically, given the answer embedding y dec output from the decoder, we calculate the localization-aware answer representation z ans by fusing y dec with the gated bounding box projection as where W loc and bias loc are weights of a linear layer to project the location bounding box to the same dimension as y dec and • represents element-wise multiplication. g loc is the localization gate. Note that our network update the gate weight automatically through training, so that it implicitly reveals the statistical importance of the localization information. Similarly, we calculate the high-level localizationaware representation z ocr m (where m = 1, ...M ) of each OCR token as where y ocr m , denotes the m-th OCR encoding from the last encoder layer and b ocr m is the corresponding bounding box coordinates. b ocr m goes through the same linear projection layer and localization gate as b pred so that they are projected to the same high-dimensional space. Then similar to (Hu et al., 2020), we obtain the similarity score s ocr m between each OCR representation and the answer representation as s ocr m = (W ans * z ans + bias ans )(W ocr * z ocr m + bias ocr ), m = 1, ..., M where W ans , bias ans , W ocr and bias ocr are parameters of linear projection layers. The localizationaware answer representation z ans is also fed into a classifier to output V scores s voc v (v = 1, ..., V ), where V is the vocabulary size. The final prediction is selected as the element with the maximum score as

Bounding box
Classes Figure 3: An overview of the transformer with simplified decoder (TSD). TSD output is used to generate the bounding box, which is then used for answer prediction Note that the predicted bounding box is not explicitly used in generating the answer. However, localization prediction is a vision task so it can enforce the network to exploit visual features. As a result, it serves as a good complement to the classical vocabulary classification task, which mainly focuses on linguistic semantics. The localization-aware predictor strengthens the learned answer embedding to attend to the correct OCR token, which in turn facilitates the classifier to correctly find the word. Moreover, this localization information improves the performance of position-related questions as shown in Figure  4(a) and 4(c), which will be further discussed in Section 4.2

Loss Design to Incorporate the Evidence Scores
We use the IoU scores as the evidence for the answer generated. Therefore, we propose a multitask loss, which facilitates the answer embedding to learn both the semantics and localization information provided by the OCR tokens. The proposed multi-task loss consists of three individual loss functions: localization loss L l , semantic loss L s and the fusion loss L f .
The answer embedding output from the decoder is fed into a multilayer perceptron (MLP) to directly predict the bounding box location b pred of the answer OCR token. Inspired by (Carion et al., 2020), the localization loss L l is defined as: where b gt denotes the ground truth bounding box, which is obtained by matching the OCR token text to the ground truth answer text. IoU and L 1 calculate the intersection over union and L1 norm respectively between the prediction and ground-truth bounding box. I = 1 if the answer word matches one of the recognized OCR text and 0 otherwise. To accurately answer a question, OCR localization and semantic information are both critical. Thus, we propose a fusion loss L f to couple the localization prediction and semantic representation of the answer. The two aspects of information complement each other in the process of decision making. Formally, given the target scores t ocr m ∈ {0, 1}(m = 1, ...M ), we formulate our fusion loss L f using cross entropy as In order to exploit the linguistic meaning of the answer embedding, we collect a fixed vocabulary of frequently used words. We feed the localization-aware answer representation z ans into a linear classifier to classify answer embedding of each decoding step to one of the vocabulary. Our semantic loss L s is computed as the cross entropy between the classification score vector and the one hot encoding from the ground truth word. The overall multi-task loss of the network is calculated as L = L f + λ l L l + λ s L s , where λ l and λ s are regulation coefficients that determine the importance of localization loss and semantic loss. The value of λ l and λ s are experimentally selected.

Transformer with Simplified Decoder
Existing works (Hu et al., 2020;Gao et al., 2020) use BERT alike transformer architecture, which allows each decoder layer to attend to the same depth encoder layer. However, a deeper encoder layer extracts a more broad view of the input than a shallow layer (Clark et al., 2019). As such, we adopt the standard transformer encoder-decoder structure as shown in Figure 3. Here, we use the transformer with simplified decoder (TSD) by removing the decoder self-attention to save the computational cost. We experimentally find that only using the encoder-decoder attention can maintain the same performance. The multimodal inputs are encoded by L stacked standard transformer encoder layers. The embedding of the last encoder layer is fed into each of the L decoder layers. The answer word is generated in an auto-regressive manner, i.e. for each decoding step, we take the predicted answer embedding from the previous step as the decoder input and obtain the answer embedding as the decoder output. The decoding process is performed by the proposed localization-aware prediction module as shown in Figure 1 and discussed in Section 3.3.

Experiments
We evaluate our LaAP-Net on the three challenging benchmark datasets: TextVQA , ST-VQA (Biten et al., 2019a) and OCR-VQA (Mishra et al., 2019). We show that the proposed LaAP-VQA network outperforms state-of-the-art works on these datasets. We further perform the ablation study to investigate the proposed context-enriched OCR representation (COR) and the localization-aware answer prediction (LaAP) on TextVQA dataset.

Implementation Details
For a fair comparison with the state-of-the-art methods, we follow the same multimodal input as M4C (Hu et al., 2020). More specifically, we use a pretrained BERT (Devlin et al., 2019) model for question encoding, the Rosetta-en OCR system (Borisyuk et al., 2018) for OCR representation and a Faster- RCNN (Ren et al., 2015) based image feature extraction. The OCR tokens are represented by a concatenation of the appearance features from Faster R-CNN , FastText embeddings (Bojanowski et al., 2017) , PHOC feature (Almazán et al., 2014) and bounding box (bbox) embedding. We set the common dimensionality d = 768 and the number of transformer layers L = 4. More details of training configuration are summarized in the supplementary material.

Evaluation on TextVQA Dataset
The TextVQA  dataset contains 28408 images with 34602 training, 5000 validation and 5734 testing question-answer pairs. We compare our result on TextVQA to the newest SOTA method SMA (Gao et al., 2020) and other existing works like LoRRA , MSFT VTI (MSFT-VTI, 2019), and M4C (Hu et al., 2020). The proposed LaAP-Net achieves a 40.68% validation accuracy and a 40.54% testing accuracy, which improves the SOTA by 1.10% (absolute) and 0.25% (absolute).

Methods
Val Acc. Test Acc.   Note that we only compare with SMA results using the same set of features to show the advantage of the network structure itself. We also train our network with additional data from ST-VQA dataset following M4C and boost the test accuracy by 0.95% (absolute).

Ablation Study on Network Components. Context-enriched OCR representation (COR) and
Localization-aware predictor (LaP) are the two key features of our network. We investigate the importance of both components by progressively adding them on our transformer with simplified decoder (TSD) backbone. First, we remove COR and LaP from our network and feed image object feature directly into the encoder as in M4C. The answer prediction part is also strictly following M4C. This configuration is denoted as TSD in Table.1. Then we add COR on TSD, which is denoted as TSD+COR. The third ablation is adding only LaP to TSD (TSD+LaP). Each component demonstrates a contribution to performance improvement as shown in Table 1. To further prove the effectiveness of COR and LaP, we add them on our baseline network M4C. COR and LaP individually lead to an accuracy improvement of 0.38 and 0.95 respectively. COR and LaP together boost the accuracy by 1.33. Note that our network without COR, i.e. TSD+LaP surfers from performance retrogress. The rationale behind is that flat multimodal feature used in place of COR contains both objects and OCR tokens. Object's position embedding introduces much noise for the localization task. COR absorbs context object features in OCR representation and improves its discriminating power. Meanwhile, the encoder multimodal input is sim-Method Val. Acc.
We restrict the answer generation source to study the effect of our method on word semantic learning and OCR selection. As shown in Table.2, our model significantly improves the accuracy when we only predict the answer from vocabulary. It implies that our localization prediction module enhances the network's capacity for learning the semantics of OCR tokens, which coincides with our qualitative analysis.
Evidence-based Qualitative Analysis on TextVQA Dataset. One challenge for the existing VQA system is that the correct answer generated is hard to tell whether the answer is based on the analysis of underlying reasoning or through exploiting the statistics of the dataset. As such, Intersection-over-Union (IoU) (Wang et al., 2020) is recommended to measure the evidence for the answer generated. The IoU result of our bounding box is shown in Figure 4. For example, in Figure 4(b), two IoU results (0.84, 0.68) explain the reason for the answer "startling stories". Higher IoU indicates better evidence. Furthermore, these IoU scores show the answer is generated by exploiting the image features instead of exploiting the statistics of the data set, i.e. a coincidental correlation in the data. Furthermore, we observe that most of the text VQA errors come from inaccurate OCR result. e.g. in Figure 4(d), the OCR token "intel)" is recognized wrongly, which results in the false answer of M4C. Due to the localization prediction, our method generates the correct answer even in such case (4(d)). Since localization tends to use visual features of OCR tokens rather than their text embedding, it can better determine the attended OCR token in spite of the text recognition result. With the predicted OCR bounding box, the answer generation problem is converted to a conditioned classification process P (text|predicted box) to recognize the text from the vocabulary. More examples supporting our analysis can be found in Figure 4.
Our localization predictor also shows the capability of understanding position and direction as shown in Figure 4(a, c). Our network learns to understand position in training because the ground-truth position is provided straight to guide the localization prediction, while in previous works, positional information is put through several layers of encoder and decoder without explicit guidance.

Evaluation on ST-VQA Dataset.
We evaluate the proposed model on the open vocabulary task of ST-VQA (Biten et al., 2019b), which contains 18921 training-validation images and 2971 test images. Following previous works (Hu et al., 2020;Gao et al., 2020), we split the images into training and validation set with size of 17028 and 1893 respectively.
We report both accuracy and ANLS score (default metric of ST-VQA) in Table 4. Our LaAP-Net surpasses the SOTA method by a large margin on both metrics. Note that SMA improves its baseline method M4C by only 0.004 in testing ANLS score while we boost the result by 0.019. Evidence-based Qualitative Analysis on ST-VQA Figure 5 shows IoU scores, our predicted bounding  box and answer. In those examples, our proposed localization-aware answer predictor not only generates correct answer, but also predicts exact bounding box(drawn in blue) of the corresponding OCR token. Similar conclusion can be drawn from the result as discussed for TextVQA dataset. In Figure  5(a), our network correctly attends to the middle sign designated by the question, where our reference method M4C fails. In Figure 5(c), our network manages to predict the word 'river' even though it is not recognized by the OCR system. More qualitative examples can be found in the supplementary material.

Evaluation on OCR-VQA Dataset
Unlike TextVQA and ST-VQA that contain "in the wild" images, OCR-VQA dataset consists of 207572 images only of book covers. Thus, the image object modality is less important in OCR-VQA. Moreover, since questions are about the title or author of a book, it is relatively difficult to determine the location. Even so, our model still achieves the state-of-the-art result, 64.1% accuracy as shown in Table 5.

Failure Analysis
Two failure cases are shown in Figure 6. As discussed in Section 4.2, our model is sensitive to positional instruction in a question. However, in Figure 6(a), the question asks about relative position, which our network does not gain knowledge on. In Figure 6(b), the position "right" is indicated by an arrow, but our network locates the road sign on the right of the image. In this case, question answering requires reasoning in addition to text reading function, which we will investigate in our future work.

Conclusion
This paper proposes a localization-aware answer prediction network (LaAP-Net) for text VQA. Our LaAP-Net not only generates the answer to the question, but also provides a bounding box as an evidence of the answer generated. Moreover, a context-enriched OCR (COR) representation is proposed to integrate object related features. The proposed LaAP-Net outperforms existing approaches on three benchmark datasets for the text VQA task by a noticeable margin with new state-of-the-art performance: TextVQA 41.41% , ST-VQA 0.485 (ANLS) and OCR-VQA 64.1%. Figure 1: Qualitative examples from TextVQA dataset. We display predicted answers (Yellow for word generated from OCR and blue for vocabulary) of our LaAP-Net and the ground-truth (GT). Our predicted bounding box (blue box) is also depicted in the images to compare to the GT box (red box). Note that some images do not contain GT bounding box while some images contain more than one GT bounding box  Figure 2: Qualitative examples from ST-VQA dataset. We display predicted answers (Yellow for word generated from OCR and blue for vocabulary) of our LaAP-Net and the ground-truth (GT). Our predicted bounding box (blue box) is also depicted in the images to compare to the GT box (red box).