Open Domain Question Answering over Tables via Dense Retrieval

Recent advances in open-domain QA have led to strong models based on dense retrieval, but only focused on retrieving textual passages. In this work, we tackle open-domain QA over tables for the first time, and show that retrieval can be improved by a retriever designed to handle tabular context. We present an effective pre-training procedure for our retriever and improve retrieval quality with mined hard negatives. As relevant datasets are missing, we extract a subset of Natural Questions (Kwiatkowski et al., 2019) into a Table QA dataset. We find that our retriever improves retrieval results from 72.0 to 81.1 recall@10 and end-to-end QA results from 33.8 to 37.7 exact match, over a BERT based retriever.


Introduction
Models for question answering (QA) over tables usually assume that the relevant table is given during test time. This applies for semantic parsing (e.g., for models trained on SPIDER (Yu et al., 2018)) and for end-to-end QA (Neelakantan et al., 2016;Herzig et al., 2020). While this assumption simplifies the QA model, it is not realistic for many use-cases where the question is asked through some open-domain natural language interface, such as web search or a virtual assistant.
In these open-domain settings, the user has some information need, and the corresponding answer resides in some table in a large corpus of tables. The QA model then needs to utilize the corpus as an information source, efficiently search for the relevant table within, parse it, and extract the answer.
Recently, much work has explored open-domain QA over a corpus of textual passages (Chen et al., 2017;Sun et al., 2018;Yang et al., 2019;Lee et al., 2019, inter alia). These approaches usually follow a two-stage framework: (1) a retriever first selects a small subset of candidate passages relevant to the * Work completed while interning at Google.

TAPAS T (h 2 ,T 2 )
S ret (q,T 1 ) ... question, and then (2) a machine reader examines the retrieved passages and selects the correct answer. While these approaches work well on free text, it is not clear whether they can be directly applied to tables, as tables are semi-structured, and thus different than free text. In this paper we describe the first study to tackle open-domain QA over tables, and focus on modifying the retriever. We follow the two-step approach of a retriever model that retrieves a small set of candidate tables from a corpus, followed by a QA model ( Figure 1). Specifically, we utilize dense retrieval approaches targeted for retrieving passages Guu et al., 2020;, and modify the retriever to better handle tabular contexts. We present a simple and effective pre-training procedure for our retriever, and further improve its performance by mining hard negatives using the retriever model. Finally, as relevant open domain datasets are missing, we process NATU-RAL QUESTIONS (Kwiatkowski et al., 2019)

Setup
We formally define open domain extractive QA over tables as follows. We are given a training set The answer a i is comprised of one or more spans of tokens in T i . Our goal is to learn a model that given a new question q and the corpus C returns the correct answer a.
Our task shares similarities with open domain QA over documents (Chen et al., 2017;Yang et al., 2019;, where the corpus C consists of textual passages extracted from documents instead of tables, and the answer is a span that appears in some passage in the corpus. As in these works, dealing with a large corpus (of tables in our setting), requires relevant context retrieval. Naively applying a QA model, for example TAPAS (Herzig et al., 2020), over each table in the large corpus is not practical because inference is too expensive.
To this end we break our system into two independent steps. First, an efficient table retriever component selects a small set of candidate tables C R from a large corpus of tables C. Second, we apply a QA model to extract the answer a given the question q and the candidate tables C R .

Dense Table Retrieval
In this section we describe our dense table retriever (DTR), which retrieves a small set of K candidate tables C R given a question q and a corpus C. In this work we set K = 10 and take C to be the set of all tables in the dataset we experiment with (see §6).
As in recent work for open domain QA on passages Guu et al., 2020;Chen et al., 2021;, we also follow a dense retrieval architecture. As tables that contain the answer to q do not necessarily include tokens from q, a dense encoding can better capture similarities between table contents and a question.
For training DTR, we leverage both in-domain training data D train , and automatically constructed pre-training data D pt of text-table pairs (see below).

Retrieval Model
In this work we focus on learning a retriever that can represent tables in a mean-ingful way, by capturing their specific structure. Traditional information retrieval methods such as BM25 are targeted to capture token overlaps between a query and a textual document, and other dense encoders are pre-trained language models (such as BERT) targeted for text representations.
Recently, Herzig et al. (2020) proposed TAPAS, an encoder based on BERT, designed to contextually represent text and a table jointly. TAPAS includes table specific embeddings that capture its structure, such as row and column ids. In DTR, we use TAPAS to represent both the query q and the table T . For efficient retrieval during inference we use two different TAPAS instances (for q and for T ), and learn a similarity metric between them as ; .
More concretely, the TAPAS encoder TAPAS(x 1 , [x 2 ]) takes one or two inputs as arguments, where x 1 is a string and x 2 is a flattened table. We then define the retrieval score as the inner product of dense vector representations of the question q and the table T : where TAPAS(·)[CLS] returns the hidden state for the CLS token, W q and W T are matrices that project the TAPAS output into d = 256 dimensional vectors, and title(T ) is the page title for table T . We found the table's page title to assist in retrieving relevant tables, which is also useful for Wikipedia passage retrieval .
Training The goal of the retriever is to create a vector space such that relevant pairs of questions and tables will have smaller distance (which results in a large dot product) than the irrelevant pairs, by learning an embedding. To increase the likelihood of gold (q, T ) pairs, we train the retriever with in-batch negatives (Gillick et al., 2019;Henderson et al., 2017;. Let where for each q i , T i is the gold table to retrieve, and for each j = i we treat T j as a negative. We now define the likelihood of the gold table T i as: .
To train the model efficiently, we define Q and T to be a B×d matrix that hold the representations for questions and tables respectively. Then, S = QT T gives an B × B matrix where the logits from the gold table are on the diagonal. We then train using a row-wise cross entropy loss where the labels are a B × B identity matrix.
Pre-training One could train our retriever from scratch, solely relying on a sufficiently large indomain training dataset D train . However, we find performance to improve after using a simple pretraining method for our retriever.  suggest to pre-train a textual dense retriever using an Inverse Cloze Task (ICT). In ICT, the goal is to predict a context given a sentence s. The context is a passage that originally contains s, but with s masked. The motivation is that the relevant context should be semantically similar to s, and should contain information missing from s.
Similarly, we posit that a table T that appears in close proximity to some text span s is more relevant to s than a random table. To construct , we use the pre-training data from Herzig et al. (2020). They extracted texttable pairs from 6.2M Wikipedia tables, where text spans were sampled from the table caption, page title, page description, segment title and text of the segment the table occurs in. This resulted in a total of 21.3M text-table (s, T ) pairs. While Herzig et al. (2020) uses extracted (s, T ) pairs for pretraining TAPAS with a masked language modeling objective, we pre-train DTR from these pairs, with the same objective used for in-domain data.
Hard Negatives Following similar work (Gillick et al., 2019;Xiong et al., 2021), we use an initial retrieval model to extract the most similar tables from C for each question in the training set. From this list we discard each table that does contain the reference answer to remove false negatives. We use the highest scoring remaining table as a particular hard negative.
Given the new triplets of question, reference table and mined negative table, we train a new model using a modified version of the in-batch negative training discussed above. Given Q and S as defined above and a new matrix N (B × d) that holds the representations of the negative tables, S = QN T gives another B × B matrix that we want to be small in value (possibly negative). If we concatenate S and S row-wise we get a new matrix for which we can perform the same cross entropy train-ing as before. The label matrix is now obtained by concatenating an identity matrix row-wise with a zero matrix.
Inference During inference time, we apply the table encoder TAPAS T to all the tables T ∈ C offline. Given a test question q, we derive its representation h q and retrieve the top K tables with representations closest to h q .
In our experiments, we use exhaustive search to find the top K tables, but to scale to large corpora, fast maximum inner product search using existing tools such as FAISS (Johnson et al., 2019) and SCANN (Guo et al., 2020) could be used, instead.

Question Answering over Tables
A reader model is used to extract the answer a given the question q and K candidate tables. The model scores each candidate and at the same time extracts a suitable answer span from the table. Each table and question are jointly encoded using a TAPAS model. The score is a simple logistic loss based on the CLS token, as in .
The answer span extraction is modeled as a softmax over all possible spans up to a certain length. Spans that are located outside of a table cell or that cross a cell are masked. Following Lee et al. (2017, the span representation is the concatenation of the contextual representation of the first and last token in the span s: The training and test data are created by running a retrieval model. We extract the K = 10 highest scoring candidate tables for each question. At training time we add the reference table if it is missing from the candidates.
At inference time all table candidates are processed and the answer of the candidate with the highest score is returned as the predicted answer.

Dataset
We create a new English dataset called NQ-TABLES from NATURAL QUESTIONS (Kwiatkowski et al.,  We randomly split the original NQ train set into train and dev (based on a hash of the page title) and use all questions from the original NQ dev set as our test set. To construct the corpus C, we extract all tables that appear in articles in all NQ sets.
NQ can contain the same Wikipedia page in different versions which leads to many almost identical tables. We merge close duplicates using the following procedure. For all tables that occur on the same Wikipedia page we flatten the entire table content, tokenize it and compute l 2 normalized unigram vectors of the token counts of each table. We then compute the pair-wise cosine similarity of all tables. We iterate over the table pairs in decreasing order of similarity and attempt to merge them into clusters. This is essentially a version of single link clustering. In particular, we will merge two tables if the similarity is > 0.91, they do not occur on the same version of the page, their difference is rows is at most 2 and they have the same number of columns.
Dataset sizes are given in the following   Table 1 shows the test results for table retrieval (dev results are in Appendix B). We report recall at K (R@K) metrics as the fraction of questions for which the highest scoring K tables contain the reference table.

Retrieval Results
We find that all dense models that have been pretrained out-peform the BM25 baseline by a large margin. The model that uses the TAPAS table embeddings (DTR) out-performs the dense baselines by more than 1 point in R@10. The addition of mined negatives (DTR +hn) yields an additional improvement of more than 5 points. Mining negatives from DTR works better than mining negatives from BM25 (DTR +hnbm25, +0.6 R@10).
End-to-End QA Results for end-to-end QA experiments are shown in Table 2 (dev results are in Appendix B). We use the exact match (EM) and token F1 metrics as implemented in SQUAD (Rajpurkar et al., 2016). 2 We additionally report oracle metrics which are computed on the best answer returned for any of the candidates.
We again find that all dense models out-perform the BM25 baseline. A TAPAS-based reader outperforms a BERT reader by more than 3 points in EM. The simple DTR model out-performs the baselines by more than 1 point in EM. Hard negatives from BM25 (+hnbm25) improve DTR's performance by 1 point, while hard negatives from DTR (+hn) improve performance by 2 points. We additionally perform a McNemar's significance test for our proposed model, DTR+hn, and find that it performs significantly better (p<0.05) than all baselines.
Analysis Analyzing the best model in Table 2 (DTR +hn) on the dev set, we find that 29% of the questions are answered correctly, 14% require a list answer (which is out of scope for this paper), 12% do not have any

Conclusion
In this paper we demonstrated that a retriever designed to handle tabular context can outperform other textual retrievers for open-domain QA on tables. We additionally showed that our retriever can be effectively pre-trained and improved by hard negatives. In future work we aim to tackle multimodal open-domain QA, combining passages and tables as context.

A Experimental Setup
The DTR model uses a batch size of 256. We pre-trained the question and table encoders for 1M steps, and fine-tuned them for a maximum of 200,000 steps, with a learning rate of 1.25e-5 using Adam and linear scheduling with warm-up and dropout rate 0.2. The hyper-parameter values were selected based on the values used by Herzig et al. (2020) on the SQA dataset. We evaluate DTR performance using recall@k, and do early stopping according to recall@10 on the dev set. We use only the tables that appear in the dev set as the corpus for the early stopping for efficiency.
For the QA reader, we initialize the model from the public TAPAS checkpoint. We use a batch size of 512, train for 50,000 steps with a learning rate of 1e-6, and dropout rate 0.2. In this setup we do not use early stopping but always train the model for the full number of steps. We limit the maximal answer length to 10 word pieces. The hyper-parameters of the QA model were optimized using a black box Bayesian optimizer similar to Google Vizier (Golovin et al., 2017). We used the hyper-parameter bounds given in Table 3. parameter min max learning rate 1e−6 1e−2 warm up ratio 0.0 0.2 dropout 0.0 0.2 We trained all models on 32 Cloud TPU v3. Pre-Training a retrieval model takes approx. 6 days. Training a retrieval model takes approx. 4-5h. Training a QA model takes approx. 10h.
The number of parameters is the same as for a BERT large model: 340M.

B Results
Dev and test results for the retrieval experiments are given in Table 4. Dev and test results for endto-end QA are given in the appendix in Table 5