TaPas: Weakly Supervised Table Parsing via Pre-training

Answering natural language questions over tables is usually seen as a semantic parsing task. To alleviate the collection cost of full logical forms, one popular approach focuses on weak supervision consisting of denotations instead of logical forms. However, training semantic parsers from weak supervision poses difficulties, and in addition, the generated logical forms are only used as an intermediate step prior to retrieving the denotation. In this paper, we present TaPas, an approach to question answering over tables without generating logical forms. TaPas trains from weak supervision, and predicts the denotation by selecting table cells and optionally applying a corresponding aggregation operator to such selection. TaPas extends BERT’s architecture to encode tables as input, initializes from an effective joint pre-training of text segments and tables crawled from Wikipedia, and is trained end-to-end. We experiment with three different semantic parsing datasets, and find that TaPas outperforms or rivals semantic parsing models by improving state-of-the-art accuracy on SQA from 55.1 to 67.2 and performing on par with the state-of-the-art on WikiSQL and WikiTQ, but with a simpler model architecture. We additionally find that transfer learning, which is trivial in our setting, from WikiSQL to WikiTQ, yields 48.7 accuracy, 4.2 points above the state-of-the-art.


Introduction
Question answering from semi-structured tables is usually seen as a semantic parsing task where the question is translated to a logical form that can be executed against the table to retrieve the correct denotation (Pasupat and Liang, 2015;Zhong et al., 2017;Agarwal et al., 2019). Semantic parsers rely on supervised training data that pairs natural language questions with logical forms, but such data is expensive to annotate.
In recent years, many attempts aim to reduce the burden of data collection for semantic parsing, including paraphrasing (Wang et al., 2015), human in the loop (Iyer et al., 2017;Lawrence and Riezler, 2018) and training on examples from other domains (Herzig and Berant, 2017;Su and Yan, 2017). One prominent data collection approach focuses on weak supervision where a training example consists of a question and its denotation instead of the full logical form (Clarke et al., 2010;Liang et al., 2011;Artzi and Zettlemoyer, 2013). Although appealing, training semantic parsers from this input is often difficult due to the abundance of spurious logical forms (Berant et al., 2013;Guu et al., 2017) and reward sparsity (Agarwal et al., 2019;Muhlgay et al., 2019). In addition, semantic parsing applications only utilize the generated logical form as an intermediate step in retrieving the answer. Generating logical forms, however, introduces difficulties such as maintaining a logical formalism with sufficient expressivity, obeying decoding constraints (e.g. wellformedness), and the label bias problem (Andor et al., 2016;Lafferty et al., 2001).
In this paper we present TAPAS (for Table  Parser), a weakly supervised question answering model that reasons over tables without generating logical forms. TAPAS predicts a minimal program by selecting a subset of the table cells and a possible aggregation operation to be executed on top of them. Consequently, TAPAS can learn operations from natural language, without the need to specify them in some formalism. This is implemented by extending BERT's architecture (Devlin et al., 2019) with additional embeddings that capture tabular structure, and with two classification layers for selecting cells and predicting a corresponding aggregation operator.
Importantly, we introduce a pre-training method for TAPAS, crucial for its success on the end task. We extend BERT's masked language model objective to structured data, and pre-train the model over millions of tables and related text segments crawled from Wikipedia. During pre-training, the model masks some tokens from the text segment and from the table itself, where the objective is to predict the original masked token based on the textual and tabular context.
Finally, we present an end-to-end differentiable training recipe that allows TAPAS to train from weak supervision. For examples that only involve selecting a subset of the table cells, we directly train the model to select the gold subset. For examples that involve aggregation, the relevant cells and the aggregation operation are not known from the denotation. In this case, we calculate an expected soft scalar outcome over all aggregation operators given the current model, and train the model with a regression loss against the gold denotation.
In comparison to prior attempts to reason over tables without generating logical forms (Neelakantan et al., 2015;Yin et al., 2016;Müller et al., 2019), TAPAS achieves better accuracy, and holds several advantages: its architecture is simpler as it includes a single encoder with no auto-regressive decoding, it enjoys pre-training, tackles more question types such as those that involve aggregation, and directly handles a conversational setting.
We find that on three different semantic parsing datasets, TAPAS performs better or on par in comparison to other semantic parsing and question answering models. On the conversational SQA (Iyyer et al., 2017), TAPAS improves stateof-the-art accuracy from 55.1 to 67.2, and achieves on par performance on WIKISQL (Zhong et al., 2017) and WIKITQ (Pasupat and Liang, 2015). Transfer learning, which is simple in TAPAS, from WIKISQL to WIKITQ achieves 48.7 accuracy, 4.2 points higher than state-of-the-art. Our code and pre-trained model are publicly available at https: //github.com/google-research/tapas.

TAPAS Model
Our model's architecture ( Figure 1) is based on BERT's encoder with additional positional embeddings used to encode tabular structure (visualized in Figure 2). We flatten the table into a sequence of words, split words into word pieces (tokens) and concatenate the question tokens before the table tokens. We additionally add two classification layers for selecting table cells and aggregation operators that operate on the cells. We now describe these modifications and how inference is performed. Additional embeddings We add a separator token between the question and the table, but unlike Hwang et al. (2019) not between cells or rows. Instead, the token embeddings are combined with table-aware positional embeddings before feeding them to the model. We use different kinds of positional embeddings: • Position ID is the index of the token in the flattened sequence (same as in BERT).
• Segment ID takes two possible values: 0 for the question, and 1 for the table header and cells.
• Column / Row ID is the index of the column/row that this token appears in, or 0 if the token is a part of the question.
• Rank ID if column values can be parsed as floats or dates, we sort them accordingly and assign an embedding based on their numeric rank (0 for not comparable, 1 for the smallest item, i + 1 for an item with rank i). This can assist the model when processing questions that involve superlatives, as word pieces may not represent numbers informatively (Wallace et al., 2019).
• Previous Answer given a conversational setup where the current question might refer to the previous question or its answers (e.g., question 5 in Figure 3), we add a special embedding that marks whether a cell token was the answer to the previous question (1 if the token's cell was an answer, or 0 otherwise).  Figure 2: Encoding of the question "query?" and a simple table using the special embeddings of TAPAS. The previous answer embeddings are omitted for brevity.

Cell selection
aggregation operator, these cells can be the final answer or the input used to compute the final answer. Cells are modelled as independent Bernoulli variables. First, we compute the logit for a token using a linear layer on top of its last hidden vector. Cell logits are then computed as the average over logits of tokens in that cell. The output of the layer is the probability p (c) s to select cell c. We additionally found it useful to add an inductive bias to select cells within a single column. We achieve this by introducing a categorical variable to select the correct column. The model computes the logit for a given column by applying a new linear layer to the average embedding for cells appearing in that column. We add an additional column logit that corresponds to selecting no column or cells.
We treat this as an extra column with no cells. The output of the layer is the probability p (co) col to select column co computed using softmax over the column logits. We set cell probabilities p (c) s outside the selected column to 0.
Aggregation operator prediction Semantic parsing tasks require discrete reasoning over the table, such as summing numbers or counting cells. To handle these cases without producing logical forms, TAPAS outputs a subset of the table cells together with an optional aggregation operator. The aggregation operator describes an operation to be applied to the selected cells, such as SUM, COUNT, AVERAGE or NONE. The operator is selected by a linear layer followed by a softmax on top of the final hidden vector of the first token (the special [CLS] token). We denote this layer as p a (op), where op is some aggregation operator.
Inference We predict the most likely aggregation operator together with a subset of the cells (using the cell selection layer). To predict a discrete cell selection we select all table cells for which their probability is larger than 0.5. These predictions are then executed against the table to retrieve the answer, by applying the predicted aggregation over the selected cells.

Pre-training
Following the recent success of pre-training models on textual data for natural language understanding tasks, we wish to extend this procedure to structured data, as an initialization for our table parsing task. To this end, we pre-train TAPAS on a large number of tables from Wikipedia. This allows the model to learn many interesting correlations between text and the table, and between the cells of a columns and their header.
We create pre-training inputs by extracting texttable pairs from Wikipedia. We extract 6.2M tables: 3.3M of class Infobox 1 and 2.9M of class WikiTable. We consider tables with at most 500 cells. All of the end task datasets we experiment with only contain horizontal tables with a header row with column names. Therefore, we only extract Wiki tables of this form using the <th> tag to identify headers. We furthermore, transpose Infoboxes into a table with a single header and a single data row. The tables, created from Infoboxes, are arguably not very typical, but we found them to improve performance on the end tasks.
As a proxy for questions that appear in the end tasks, we extract the table caption, article title, article description, segment title and text of the segment the table occurs in as relevant text snippets. In this way we extract 21.3M snippets.
We convert the extracted text-table pairs to pretraining examples as follows: Following Devlin et al. (2019), we use a masked language model pre-training objective. We also experimented with adding a second objective of predicting whether the table belongs to the text or is a random table but did not find this to improve the performance on the end tasks. This is aligned with Liu et al. (2019) that similarly did not benefit from a next sentence prediction task.
For pre-training to be efficient, we restrict our word piece sequence length to a certain budget (e.g., we use 128 in our final experiments). That is, the combined length of tokenized text and table cells has to fit into this budget. To achieve this, we randomly select a snippet of 8 to 16 word pieces from the associated text. To fit the table, we start by only adding the first word of each column name and cell. We then keep adding words turn-wise until we reach the word piece budget. For every table we generate 10 different snippets in this way.
We follow the masking procedure introduced by BERT. We use whole word masking 2 for the text, and we find it beneficial to apply whole cell masking (masking all the word pieces of the cell if any of its pieces is masked) to the table as well.
We note that we additionally experimented with data augmentation, which shares a similar goal to pre-training. We generated synthetic pairs of questions and denotations over real tables via a grammar, and augmented these to the end tasks training data. As this did not improve end task performance significantly, we omit these results.

Fine-tuning
Overview We formally define table parsing in a weakly supervised setup as follows. Given a train- , where x i is an utterance, T i is a table and y i is a corresponding set of denotations, our goal is to learn a model that maps a new utterance x to a program z, such that when z is executed against the corresponding table T , it yields the correct denotation y. The program z comprises a subset of the table cells and an optional aggregation operator. The table T maps a table cell to its value.
As a pre-processing step described in Section 5.1, we translate the set of denotations y for each example to a tuple (C, s) of cell coordinates C and a scalar s, which is only populated when y is a single scalar. We then guide training according to the content of (C, s). where s is populated but C is empty, we train the model to predict an aggregation over the table cells that amounts to s. We now describe each of these cases in detail.
Cell selection In this case y is mapped to a subset of the table cell coordinates C (e.g., question 1 in Figure 3). For this type of examples, we use a hierarchical model that first selects a single column and then cells from within that column only.
We directly train the model to select the column col which has the highest number of cells in C. For our datasets cells C are contained in a single column and so this restriction on the model provides a useful inductive bias. If C is empty we select the additional empty column corresponding to empty cell selection. The model is then trained to select cells C ∩ col and not select (T \ C) ∩ col. The loss is composed of three components: (1) the average binary cross-entropy loss over column selections: where the set of columns Cols includes the additional empty column, CE(·) is the cross entropy loss, 1 is the indicator function.
(2) the average binary cross-entropy loss over column cell selections: where Cells(col) is the set of cells in the chosen column. (3) As for cell selection examples no aggregation occurs, we define the aggregation supervision to be NONE (assigned to op 0 ), and the aggregation loss is: The total loss is then J CS = J columns + J cells + αJ aggr , where α is a scaling hyperparameter.
Scalar answer In this case y is a single scalar s which does not appear in the table (i.e. C = ∅, e.g., question 2 in Figure 3). This usually corresponds to examples that involve an aggregation over one or more  the scalar answer s. To train the model given this form of supervision one could search offline (Dua et al., 2019;Andor et al., 2019) or online (Berant et al., 2013;Liang et al., 2018) for programs (table cells and aggregation) that execute to s. In our table parsing setting, the number of spurious programs that execute to the gold scalar answer can grow quickly with the number of table cells (e.g., when s = 5, each COUNT over any five cells is potentially correct). As with this approach learning can easily fail, we avoid it. Instead, we make use of a training recipe where no search for correct programs is needed. Our approach results in an end-to-end differentiable training, similar in spirit to Neelakantan et al. (2015). We implement a fully differentiable layer that latently learns the weights for the aggregation prediction layer p a (·), without explicit supervision for the aggregation type.
Specifically, we recognize that the result of executing each of the supported aggregation operators is a scalar. We then implement a soft differentiable estimation for each operator (Table 1), given the  token selection probabilities and the table values: compute(op, p s , T ). Given the results for all aggregation operators we then calculate the expected result according to the current model: is a probability distribution normalized over aggregation operators excluding NONE.
We then calculate the scalar answer loss with Huber loss (Huber, 1964) given by: compute(SUM,p s ,T ) compute(COUNT,p s ,T ) where a = |s pred − s|, and δ is a hyperparameter. Like Neelakantan et al. (2015), we find this loss is more stable than the squared loss. In addition, since a scalar answer implies some aggregation operation, we also define an aggregation loss that penalizes the model for assigning probability mass to the NONE class: The total loss is then J SA = J aggr +βJ scalar , where β is a scaling hyperparameter. As for some examples J scalar can be very large, which leads to unstable model updates, we introduce a cutoff hyperparameter. Then, for a training example where J scalar > cutoff, we set J = 0 to ignore the example entirely, as we noticed this behaviour correlates with outliers. In addition, as computation done during training is continuous, while that being done during inference is discrete, we further add a temperature that scales token logits such that p s would output values closer to binary ones.
Ambiguous answer A scalar answer s that also appears in the table (thus C = ∅) is ambiguous, as in some cases the question implies aggregation (question 3 in Figure 3), while in other cases a  cell should be predicted (question 4 in Figure 3). Thus, in this case we dynamically let the model choose the supervision (cell selection or scalar answer) according to its current policy. Concretely, we set the supervision to be of cell selection if p a (op 0 ) ≥ S, where 0 < S < 1 is a threshold hyperparameter, and the scalar answer supervision otherwise. This follows hard EM (Min et al., 2019), as for spurious programs we pick the most probable one according to the current model.

Datasets
We experiment with the following semantic parsing datasets that reason over single tables (see Table 2).
WIKITQ (Pasupat and Liang, 2015) This dataset consists of complex questions on Wikipedia tables. Crowd workers were asked, given a table, to compose a series of complex questions that include comparisons, superlatives, aggregation or arithmetic operation. The questions were then verified by other crowd workers.
SQA (Iyyer et al., 2017) This dataset was constructed by asking crowd workers to decompose a subset of highly compositional questions from WIKITQ, where each resulting decomposed question can be answered by one or more table cells.
The final set consists of 6, 066 question sequences (2.9 question per sequence on average).
WIKISQL (Zhong et al., 2017) This dataset focuses on translating text to SQL. It was constructed by asking crowd workers to paraphrase a templatebased question in natural language. Two other crowd workers were asked to verify the quality of the proposed paraphrases.
As our model predicts cell selection or scalar answers, we convert the denotations for each dataset to question, cell coordinates, scalar answer triples. SQA already provides this information (gold cells for each question). For WIKISQL and WIKITQ, we only use the denotations. Therefore, we derive cell coordinates by matching the denotations against the table contents. We fill scalar answer information if the denotation contains a single element that can be interpreted as a float, otherwise we set its value to NaN. We drop examples if there is no scalar answer and the denotation can not be found in the table, or if some denotation matches multiple cells.

Experimental Setup
We apply the standard BERT tokenizer on questions, table cells and headers, using the same vocabulary of 32k word pieces. Numbers and dates are parsed in a similar way as in the Neural Programmer (Neelakantan et al., 2017).
The official evaluation script of WIKITQ and SQA is used to report the denotation accuracy for these datasets. For WIKISQL, we generate the reference answer, aggregation operator and cell coordinates from the reference SQL provided using our own SQL implementation running on the JSON tables. However, we find that the answer produced by the official WIKISQL evaluation script is incorrect for approx. 2% of the examples. Throughout this paper we report accuracies against our reference answers, but we explain the differences and also provide accuracies compared to the official reference answers in Appendix A.
We start pre-training from BERT-Large (see Appendix B for hyper-parameters). We find it beneficial to start the pre-training from a pre-trained standard text BERT model (while randomly initializing our additional embeddings), as this enhances convergence on the held-out set.
We run both pre-training and fine-tuning on a setup of 32 Cloud TPU v3 cores with maximum sequence length 512. In this setup pre-training takes around 3 days and fine-tuning around 10 hours for WIKISQL and WIKITQ and 20 hours for SQA (with the batch sizes from table 12). The resource requirements of our model are essentially the same as BERT-large 3 .
For fine-tuning, we choose hyper-parameters using a black box Bayesian optimizer similar to Google Vizier (Golovin et al., 2017) for WIKISQL and WIKITQ. For SQA we use grid-search. We discuss the details in Appendix B.

Model
Dev Test

Results
All results report the denotation accuracy for models trained from weak supervision. We follow Niven and Kao (2019) and report the median for 5 independent runs, as BERT-based models can degenerate. We present our results for WIKISQL and WIKITQ in Tables 3 and 4 respectively. Table  3 shows that TAPAS, trained in the weakly supervised setting, achieves close to state-of-the-art performance for WIKISQL (83.6 vs 83.9 (Min et al., 2019)). If given the gold aggregation operators and selected cell as supervision (extracted from the reference SQL), which accounts as full supervision to TAPAS, the model achieves 86.4. Unlike the full SQL queries, this supervision can be annotated by non-experts. For WIKITQ the model trained only from the original training data reaches 42.6 which surpass similar approaches (Neelakantan et al., 2015). When we pre-train the model on WIKISQL or SQA (which is straight-forward in our setup, as we do not rely on a logical formalism), TAPAS achieves 48.7 and 48.8, respectively.   For SQA, Table 5 shows that TAPAS leads to substantial improvements on all metrics: Improving all metrics by at least 11 points, sequence accuracy from 28.1 to 40.4 and average question accuracy from 55.1 to 67.2. Table 6 shows an ablation study on our different embeddings. To this end we pretrain and fine-tune models with different features. As pre-training is expensive we limit it to 200, 000 steps. For all datasets we see that pre-training on tables and column and row embeddings are the most important. Positional and rank embeddings are also improving the quality but to a lesser extent.

Model ablations
We additionally find that when removing the scalar answer and aggregation losses (i.e., setting J SA=0 ) from TAPAS, accuracy drops for both datasets. For WIKITQ, we observe a substantial drop in performance from 29.0 to 23.1 when removing aggregation. For WIKISQL performance drops from 84.7 to 82.6. The relatively small decrease for WIKISQL can be explained by the fact that most examples do not need aggregation to be answered. In principle, 17% of the examples of the dev set have an aggregation (SUM, AVERAGE or COUNT), however, for all types we find that for more than 98% of the examples the aggregation is only applied to one or no cells. In the case of SUM and AVERAGE, this means that most examples can be answered by selecting one or no cells from the table. For COUNT the model without aggregation operators achieves 28.2 accuracy (by selecting 0 or 1 from the table) vs. 66.5 for the model with aggregation. Note that 0 and 1 are often found in a special index column. These properties of WIK-ISQL make it challenging for the model to decide whether to apply aggregation or not. For WIKITQ on the other hand, we observe a substantial drop in performance from 29.0 to 23.1 when removing aggregation.
Qualitative Analysis on WIKITQ We manually analyze 200 dev set predictions made by TAPAS on WIKITQ. For correct predictions via an aggregation, we inspect the selected cells to see if they match the ground truth. We find that 96% of the correct aggregation predictions where also correct in terms of the cells selected. We further find that 14% of the correct aggregation predictions had only one cell, and could potentially be achieved by cell selection, with no aggregation.
We also perform an error analysis and identify the following exclusive salient phenomena: (i) 12% are ambiguous ("Name at least two labels that released the group's albums."), have wrong labels or missing information ; (ii) 10% of the cases require complex temporal comparisons which could also not be parsed with a rich formalism such as SQL ("what country had the most cities founded in the 1830's?") ; (iii) in 16% of the cases the gold denotation has a textual value that does not appear in the table, thus it could not be predicted without performing string operations over cell values ; (iv) on 10%, the table is too big to fit in 512 tokens ; (v) on 13% of the cases TAPAS selected no cells, which suggests introducing penalties for this behaviour ; (vi) on 2% of the cases, the answer is the difference between scalars, so it is outside of the model capabilities ("how long did anne churchill/spencer live?") ; (vii) the other 37% of the cases could not be classified to a particular phenomenon.
Pre-training Analysis In order to understand what TAPAS learns during pre-training we analyze its performance on 10,000 held-out examples. We split the data such that the tables in the held-out  data do not occur in the training data. Table 7 shows the accuracy of masked word pieces of different types and in different locations. We find that average accuracy across position is relatively high (71.4). Predicting tokens in the header of the table is easiest (96.6), probably because many Wikipedia articles use instances of the same kind of table. Predicting word pieces in cells is a bit harder (63.4) than predicting pieces in the text (68.8). The biggest differences can be observed when comparing predicting words (74.1) and numbers (53.9). This is expected since numbers are very specific and often hard to generalize. The soft-accuracy metric and example (Appendix C) demonstrate, however, that the model is relatively good at predicting numbers that are at least close to the target.
Limitations TAPAS handles single tables as context, which are able to fit in memory. Thus, our model would fail to capture very large tables, or databases that contain multiple tables. In this case, the table(s) could be compressed or filtered, such that only relevant content would be encoded, which we leave for future work. In addition, although TAPAS can parse compositional structures (e.g., question 2 in Figure 3), its expressivity is limited to a form of an aggregation over a subset of table cells. Thus, structures with multiple aggregations such as "number of actors with an average rating higher than 4" could not be handled correctly. Despite this limitation, TAPAS succeeds in parsing three different datasets, and we did not encounter this kind of errors in Section 5.3. This suggests that the majority of examples in semantic parsing datasets are limited in their compositionality.

Related Work
Semantic parsing models are mostly trained to produce gold logical forms using an encoder-decoder approach (Jia and Liang, 2016;Dong and Lapata, 2016). To reduce the burden in collecting full logical forms, models are typically trained from weak supervision in the form of denotations. These are used to guide the search for correct logical forms (Clarke et al., 2010;Liang et al., 2011).
Other works suggested end-to-end differentiable models that train from weak supervision, but do not explicitly generate logical forms. Neelakantan et al. (2015) proposed a complex model that sequentially predicts symbolic operations over table segments that are all explicitly predefined by the authors, while Yin et al. (2016) proposed a similar model where the operations themselves are learned during training. Müller et al. (2019) proposed a model that selects table cells, where the table and question are represented as a Graph Neural Network, however their model can not predict aggregations over table cells. Cho et al. (2018) proposed a supervised model that predicts the relevant rows, column and aggregation operation sequentially. In our work, we propose a model that follow this line of work, with a simpler architecture than past models (as the model is a single encoder that performs computation for many operations implicitly) and more coverage (as we support aggregation operators over selected cells).
Finally, pre-training methods have been designed with different training objectives, including language modeling (Dai and Le, 2015;Peters et al., 2018;Radford et al., 2018) and masked language modeling (Devlin et al., 2019;Lample and Conneau, 2019). These methods dramatically boost the performance of natural language understanding models (Peters et al., 2018, inter alia

Conclusion
In this paper we presented TAPAS, a model for question answering over tables that avoids generating logical forms. We showed that TAPAS effectively pre-trains over large scale data of text-table pairs and successfully restores masked words and table cells. We additionally showed that the model can fine-tune on semantic parsing datasets, only using weak supervision, with an end-to-end differentiable recipe. Results show that TAPAS achieves better or competitive results in comparison to stateof-the-art semantic parsers.
In future work we aim to extend the model to represent a database with multiple tables as context, and to effectively handle large tables.

A WIKISQL Execution Errors
In some tables, WIKISQL contains "REAL" numbers stored in "TEXT" format. This leads to incorrect results for some of the comparison and aggregation examples. These errors in the WIK-ISQL execution accuracy penalize systems that do their own execution (rather then producing an SQL query). Table 8 shows two examples where our result derivation and the one used by WIK-ISQL differ because the numbers in the "Crowd" (col5) column are not represented as numbers in the respective SQL table. Table 9 and 10 contain accuracies compared against the official and our answers.

C Pre-training Example
In order to better understand how well the model predicts numbers, we relax our accuracy measure to a soft form of accuracy: if x or y is not a number 1.0 − |x−y| max(x,y) else With this soft metric we get an overall accuracy of 74.5 (instead of 71.4) and an accuracy of 80.5 (instead of 53.9) for numbers. Showing that the model is pretty good at guessing numbers that are at least close to the target. The following example demonstrates this:  In the example, the model correctly restores the Draw (D) and Loss (L) numbers for Spain. It fails to restore the Points For (PF) and Points Against (PA) for Zimbabwe, but gives close estimates. Note that the model also does not produce completely consistent results for each row we should have PA + PD = PF and the column sums of PF and PA should equal.

D The average of stochastic sets
Our approach to estimate aggregates of cells in the table operates directly on latent conditionally independent Bernoulli variables G c ∼ Bern(p c ) that indicate whether each cell is included in the aggregation and a latent categorical variable that indicates the chosen aggregation operation op: AVERAGE,   Table "2-10767641-15" from WIKISQL. "col6" was removed. The "Crowd" column is of type "REAL" but the cell values are actually stored as "TEXT". Below we have two questions from the training set with the answer that is produced by the WIKISQL evaluation script and the answer we derive.