Understanding tables with intermediate pre-training

Table entailment, the binary classification task of finding if a sentence is supported or refuted by the content of a table, requires parsing language and table structure as well as numerical and discrete reasoning. While there is extensive work on textual entailment, table entailment is less well studied. We adapt TAPAS (Herzig et al., 2020), a table-based BERT model, to recognize entailment. Motivated by the benefits of data augmentation, we create a balanced dataset of millions of automatically created training examples which are learned in an intermediate step prior to fine-tuning. This new data is not only useful for table entailment, but also for SQA (Iyyer et al., 2017), a sequential table QA task. To be able to use long examples as input of BERT models, we evaluate table pruning techniques as a pre-processing step to drastically improve the training and prediction efficiency at a moderate drop in accuracy. The different methods set the new state-of-the-art on the TabFact (Chen et al., 2020) and SQA datasets.


Introduction
Textual entailment (Dagan et al., 2005), also known as natural language inference (Bowman et al., 2015), is a core natural language processing (NLP) task. It can predict effectiveness of reading comprehension (Dagan et al., 2010), which argues that it can form the foundation of many other NLP tasks, and is a useful neural pre-training task (Subramanian et al., 2018;Conneau et al., 2017).
Textual entailment is well studied, but many relevant data sources are structured or semi-structured: health data both worldwide and personal, fitness trackers, stock markets, and sport statistics. While some information needs can be anticipated by handcrafted templates, user queries are often surprising, and having models that can reason and parse that structure can have a great impact in real world applications (Khashabi et al., 2016;Clark, 2019).
A recent example is TABFACT , a dataset of statements that are either entailed or refuted by tables from Wikipedia (Figure 1). Because solving these entailment problems requires sophisticated reasoning and higher-order operations like arg max, averaging, or comparing, human accuracy remains substantially (18 points) ahead of the best models (Zhong et al., 2020).
The current models are dominated by semantic parsing approaches that attempt to create logical forms from weak supervision. We, on the other hand, follow Herzig et al. (2020) and  and encode the tables with BERT-based models to directly predict the entailment decision. But while BERT models for text have been scrutinized and optimized for how to best pre-train and represent textual data, the same attention has not been applied to tabular data, limiting the effectiveness in this setting. This paper addresses these shortcomings using intermediate task pretraining (Pruksachatkun et al., 2020), creating efficient data representations, and applying these improvements to the tabular entailment task.
Our methods are tested on the English language, mainly due to the availability of the end task resources. However, we believe that the proposed solutions could be applied in other languages where a pre-training corpus of text and tables is available, such as the Wikipedia datasets.
Our main contributions are the following: i) We introduce two intermediate pre-training tasks, which are learned from a trained MASK-LM model, one based on synthetic and the other from counterfactual statements. The first one generates a sentence by sampling from a set of logical expressions that filter, combine and compare the information on the table, which is required in table entailment (e.g., knowing that Gerald Ford is taller than the average president requires summing all presidents and dividing by the number of presidents). The second one corrupts sentences about tables appearing on Wikipedia by swapping entities for plausible alternatives. Examples of the two tasks can be seen in Figure 1. The procedure is described in detail in section 3.
ii) We demonstrate column pruning to be an effective means of lowering computational cost at minor drops in accuracy, doubling the inference speed at the cost of less than one accuracy point.
iii) Using the pre-training tasks, we set the new state-of-the-art on TABFACT out-performing previous models by 6 points when using a BERT-base model and 9 points for a BERT-large model. The procedure is data efficient and can get comparable accuracies to previous approaches when using only 10% of the data. We perform a detailed analysis of the improvements in Section 6. Finally, we show that our method improves the state-of-the-art on a question answering task (SQA) by 4 points.
We release the pre-training checkpoints, data generation and training code at github.com/googleresearch/tapas.

Model
We use a model architecture derived from BERT and add additional embeddings to encode the table structure, following the approach of Herzig et al. (2020) to encode the input.
The statement and table in a pair are tokenized into word pieces and concatenated using the standard [CLS] and [SEP] tokens in between. The table is flattened row by row and no additional separator is added between the cells or rows. Six types of learnable input embeddings are added together as shown in Appendix B. Token embeddings, position embeddings and segment embeddings are analogous to the ones used in standard BERT. Additionally we follow Herzig et al. (2020) and use column and row embeddings which encode the two dimensional position of the cell that the token corresponds to and rank embeddings for numeric columns that encode the numeric rank of the cell with respect to the column, and provide a simple way for the model to know how a row is ranked according to a specific column.
Recall that the bi-directional self-attention mechanism in transformers is unaware of order, which motivates the usage of positional and segment embeddings for text in BERT, and generalizes naturally to column and row embeddings when processing tables, in the 2-dimensional case.
Let s and T represent the sentence and table respectively and E s and E T be their corresponding input embeddings. The sequence (Vaswani et al., 2017) denoted f and a contextual representation is obtained for every token. We model the probability of entailment P (s|T ) with a single hidden layer neural network computed from the output of the [CLS] token: where the middle layer has the same size as the hidden dimension and uses a tanh activation and the final layer uses a sigmoid activation.

Methods
The use of challenging pre-training tasks has been successful in improving downstream accuracy (Clark et al., 2020). One clear caveat of the method adopted in Herzig et al. (2020) which attempts to fill in the blanks of sentences and cells in the table is that not much understanding of the table in relation with the sentence is needed.
With that in mind, we propose two tasks that require sentence-table reasoning and feature complex operations performed on the table and entities grounded in sentences in non-trivial forms.
We discuss two methods to create pre-training data that lead to stronger table entailment models. Both methods create statements for existing Wikipedia tables 2 . We extract all tables that have at least two columns, a header row and two data rows. We recursively split tables row-wise into the upper and lower half until they have at most 50 cells. This way we obtain 3.7 million tables.

Counterfactual Statements
Motivated by work on counterfactually-augmented data (Kaushik et al., 2020;Gardner et al., 2020), we propose an automated and scalable method to get table entailments from Wikipedia and, for each such positive examples, create a minimally differing refuted example. For this pair to be useful we want that their truth value can be predicted from the associated table but not without it.
The tables and sentences are extracted from Wikipedia as follows: We use the page title, description, section title, text and caption. We also use all sentences on Wikipedia that link to the table's page and mentions at least one page (entity) that is also mentioned in the table. Then these snippets are split into sentences using the NLTK (Loper and Bird, 2002) implementation of Punkt (Kiss and Strunk, 2006). For each relevant sentence we create one positive and one negative statement.
Consider the table in Figure 1 and the sentence '[Greg Norman] is [Australian].' (Square brackets indicate mention boundaries.). A mention 3 is a potential focus mention if the same entity or value is also mentioned in the table. In our example, Greg Norman and Australian are potential focus mentions. Given a focus mention (Greg Norman) we define all the mentions that occur in the same column (but do not refer to the same entity) as the replacement mentions (e.g., Billy Mayfair, Lee Janzen, . . . ). We expect to create a false statement if we replace the focus mention with a replacement mention (e.g., 'Billy Mayfair is Australian.'), but there is no guarantee it will be actually false.
We call a mention of an entity that occurs in the same row as the focus entity a supporting mention, because it increases the chance that we falsify the statement by replacing the focus entity. In our example, Australian would be a supporting mention for Greg Norman (and vice versa). If we find a supporting mention we restrict the replacement candidates to the ones that have a different value.
In the example, we would not use Steve Elkington since his row also refers to Australia.
Some replacements can lead to ungrammatical statements that a model could use to identify the negative statements, so we found it is useful to also replace the entity in the original positive sentence from Wikipedia with the mention from the table. 4 We also introduce a simple type system for entities (named entity, date, cardinal number and ordinal number) and only replace entities of the same type. Short sentences having less than 4 tokens not counting the mention, are filtered out.
Using this approach we extract 4.1 million counterfactual pairs of which 546 thousand do have a supporting mention and the remaining do not.
We evaluated 100 random examples manually and found that the percentage of negative statements that are false and can be refuted by the table is 82% when they have a supporting mention and 22% otherwise. Despite this low value we still found the examples without supporting mention to improve accuracy on the end tasks (Appendix F).

Synthetic Statements
Motivated by previous work (Geva et al., 2020), we propose a synthetic data generation method to improve the handling of numerical operations and comparisons. We build a table-dependent statement that compares two simplified SQL-like expressions. We define the (probabilistic) context-free grammar shown in Figure 2. Synthetic statements are sampled from the CFG. We constrain the select values of the left and right expression to be either both the count or to have the same value for column .   Table 1.

Name
Result first the value in C with the lowest row index. last the value in C with the highest row index. greatest the value in C with the highest numeric value. lowest the value in C with the lowest numeric value. sum The sum of all the numeric values. average The average of all the numeric values. range The difference between greatest and lowest. This guarantees that the domains of both expressions are comparable. value is chosen as at random from the respective column. A statement is redrawn if it yields an error (see Table 1). With probability 0.5 we replace one of both expressions by the values it evaluates to. In the example given in figure 1

Table pruning
Some input examples from TABFACT can be too long for BERT-based models. We evaluate table pruning techniques as a pre-processing step to select relevant columns that respect the input example length limits. As described in section 2, an example is built by concatenating the statement with the flattened table. For large tables the example length can exceed the capacity limit of the transformer.
The TAPAS model handles this by shrinking the text in cells. A token selection algorithm loops over the cells. For each cell it starts by selecting the first token, then the second and so on until the maximal length is reached. Unless stated otherwise we use the same approach. Crucially, selecting only relevant columns would allow longer examples to fit without discarding potentially relevant tokens.
Heuristic entity linking (HEL) is used as a baseline. It is the table pruning used in TABLE-BERT . The algorithm aligns spans in statement to the columns by extracting the longest character n-gram that matches a cell. The span matches represent linked entities. Each entity in the statement can be linked to only one column. We use the provided entity linking statements data 5 . We run the TAPAS algorithm on top of the input data to limit the input size.
We propose a different method that tries to retain as many columns as possible. In our method, the columns are ranked by a relevance score and added in order of decreasing relevance. Columns that exceed the maximum input length are skipped. The algorithm is detailed in Appendix F. Heuristic exact match (HEM) computes the Jaccard coefficient between the statement and each column. Let T S be the set of tokens in the statement S and T C the tokens in column C, with C ∈ C the set of columns. Then the score between the statement and column is given by |T S ∩T C | |T S ∪T C | . We also experimented with approaches based on word2vec (Mikolov et al., 2013), character overlap and TF-IDF. Generally, they produced worse results than HEM. Details are shown in Appendix F.

Experimental Setup
In all experiments, we start with the public TAPAS checkpoint, 6 train an entailment model on the data from Section 3 and then fine-tune on the end task (TABFACT or SQA). We report the median accuracy values over 3 pre-training and 3 fine-tuning runs (9 runs in total). We estimate the error margin as half the interquartile range, that is half the difference between the 25th and 75th percentiles. The hyper-parameters, how we chose them, hardware and other information to reproduce our experiments are explained in detail in Appendix A.
The training time depends on the sequence length used. For a BERT-Base model it takes around 78 minutes using 128 tokens and it scales almost linearly up to 512. For our pre-training tasks, we explore multiple lengths and how they trade-off speed for downstream results.

Datasets
We evaluate our model on the recently released TABFACT dataset . The tables are extracted from Wikipedia and the sentences written by crowd workers in two batches. The first batch consisted of simple sentences, that instructed the writers to refer to a single row in the table. The second one, created complex sentences by asking writers to use information from multiple rows.
In both cases, crowd workers initially created only positive (entailed) pairs, and in a subsequent annotation job, the sentences were copied and edited into negative ones, with instructions of avoiding simple negations. Finally, there was a third verification step to filter out bad rewrites. The final count is 118, 000. The split sizes are given in Appendix C. An example of a table and the sentences is shown in Figure 1. We use the standard TABFACT split and the official accuracy metric.
We also use the SQA (Iyyer et al., 2017) dataset for pre-training (following Herzig et al. (2020)) and for testing if our pre-training is useful for related tasks. SQA is a question answering dataset that was created by asking crowd workers to split a compositional subset of WikiTableQuestions (Pasupat and Liang, 2015) into multiple referential questions. The dataset consists of 6,066 sequences (2.9 question per sequence on average). We use the standard split and official evaluation script.

Baselines
Chen et al. (2020) present two models, TABLE-BERT and the Latent Program Algorithm (LPA), that yield similar accuracy on the TABFACT data.
LPA tries to predict a latent program that is then executed to verify if the statement is correct or false. The search over programs is restricted using lexical heuristics. Each program and sentence is encoded with an independent transformer model and then a linear layer gives a relevance score to the pair. The model is trained with weak supervision where programs that give the correct binary answer are considered positive and the rest negative.
TABLE-BERT is a BERT-base model that similar to our approach directly predicts the truth value of the statement. However, the model does not use special embeddings to encode the table structure but relies on a template approach to format the table as natural language. The table is mapped into a single sequence of the form: "Row 1 Rank is 1; the Player is Greg Norman; ... . Row 2 ...". The model is also not pre-trained on table data.
LOGICALFACTCHECKER (Zhong et al., 2020) is another transformer-based model that given a candidate logical expression, combines contextual embeddings of program, sentence and table, with a tree- RNN (Socher et al., 2013) to encode the parse tree of the expression. The programs are obtained through either LPA or an LSTM generator (Seq2Action).

Results
TABFACT In Table 2 we find that our approach outperforms the previous state-of-the-art on TAB-FACT by more than 6 points (Base) or more than 9 points (Large). A model initialized only with the public TAPAS MASK-LM checkpoint is behind state-of-the-art by 2 points (71.7% vs 69.9%). If we train on the counterfactual data, it out-performs LOGICALFACTCHECKER and reaches 75.2% test accuracy (+5.3), slightly above using SQA. Only using the synthetic data is better (77.9%), and when using both datasets it achieves 78.5%. Switching from BERT-Base to Large improves the accuracy by another 2.5 points. The improvements are consistent across all test sets.

Zero-Shot Accuracy and low resource regimes
The pre-trained models are in principle already complete table entailment predictors. Therefore it is interesting to look at their accuracy on the TABFACT evaluation set before fine-tuning them. We find that the best model trained on all the pretraining data is only two points behind the fully trained TABLE-BERT (63.8% vs 66.1%). This relatively good accuracy mostly stems from the counterfactual data.
When looking at low data regimes in Figure  3 we find that pre-training on SQA or our artificial data consistently leads to better results than just training with the MASK-LM objective. The models with synthetic pre-training data start outperforming TABLE-BERT when using 5% of the    (2020) and Zhong et al. (2020). The best BERT-base model while comparable in parameters out-performs TABLE-BERT by more than 12 points. Pre-training with counterfactual and synthetic data gives an accuracy 8 points higher than only using MASK-LM and more than 3 points higher than using SQA. Both counterfactual and synthetic data out-perform pre-training with a MASK-LM objective and SQA. Joining the two datasets gives an additional improvement. Error margins are estimated as half the interquartile range.    training set. The setup with all the data is consistently better than the others and synthetic and counterfactual are both better than SQA.
SQA Our pre-training data also improves the accuracy on a QA task. On SQA (Iyyer et al., 2017) a model pre-trained on the synthetic entailment data outperforms one pre-trained on the MASK-LM task alone (Table 3) Table 3: SQA test results. ALL is the average question accuracy and SEQ the sequence accuracy. Both counterfactual and synthetic data out-perform the MASK-LM objective. Our Large model outperforms the MASK-LM model by almost 4 points on both metrics. Our best Base model is comparable to the previous state-of-the-art. Error margins are estimated as half the interquartile range.
Efficiency As discussed in Section 3.3 and Appendix A.4, we can increase the model efficiency by reducing the input length. By pruning the input of the TABFACT data we can improve training as well as inference time. We compare pruning with the heuristic entity linking (HEL)  and heuristic exact match (HEM) to different target lengths. We also studied other pruning methods, the results are reported in Appendix F. In Table 4 we find that HEM consistently outperforms HEL. The best model at length 256, while twice as fast to train (and apply), is only 0.8 points behind the best full length model. Even the model    with length 128, while using a much shorter length, out-performs TABLE-BERT by more than 7 points. Given a pre-trained MASK-LM model our training consists of training on the artificial pre-training data and then fine-tuning on TABFACT. We can therefore improve the training time by pre-training with shorter input sizes. Table 4 shows that 512 and 256 give similar accuracy while the results for 128 are about 1 point lower.

Analysis
Salient Groups To obtain detailed information of the improvements of our approach, we manually annotated 200 random examples with the complex operations needed to answer them. We found 4 salient groups: Aggregations, superlatives, com-7 Not explicitly mentioned in the paper but implied by the batch size given (6) and the defaults in the code. paratives and negations, and sort pairs into these groups via keywords in the text. To make the groups exclusive, we add a fifth case when more than one operation is needed. The accuracy of the heuristics was validated through further manual inspection of 50 samples per group. The trigger words of each group are described in Appendix G.
For each group within the validation set, we look at the difference in accuracy between different models. We also look at how the total error rate is divided among the groups as a way to guide the focus on pre-training tasks and modeling. The error rate defined in this way measures potential accuracy gains if all the errors in a group S were fixed: ER(S) = |{ Errors in S}| |{ Validation examples}| . Among the groups, the intermediate task data improve superlatives (39% error reduction) and negations (31%) most (Table 5). For example, we see that the accuracy is higher for superlatives than the for the overall validation set.
In Figure 4 we show examples in every group where our model is correct on the majority of the cases (across 9 trials), and the MASK-LM baseline is not. We also show examples that continue to produce errors after our pre-training. Many examples in this last group require multi-hop reasoning or complex numerical operations.
Model Agreement Similar to other complex binary classification datasets such as BOOLQ (Clark et al., 2019), for TABFACT one may question whether models are guessing the right answer. To detect the magnitude of this issue we look at 9 independent runs of each variant and analyze how many of them agree on the correct answer. Figure 5 shows that while for MASK-LM only for 24.2% of the examples all models agree on the right answer, it goes up to 55.5% when using using the counterfactual and synthetic pre-training. This suggests that the amount of guessing decreases substantially.

Related Work
Logic-free Semantic Parsing Recently, methods that skip creating logical forms and generate answers directly have been used successfully for semantic parsing (Mueller et al., 2019). In this group, TAPAS (Herzig et al., 2020) uses special learned embeddings to encode row/column index and numerical order and pretrains a MASK-LM model on a large corpus of text and tables co-occurring on Wikipedia articles. Importantly, next sentence prediction from , which in this

Group
Consistently Choi Moon -Sik played in Seoul three times in total.
The total number of bronze medals were half of the total number of medals.

Superlatives
Mapiu school has the highest roll in the state authority.
Carlos Moya won the most tournaments with two wins. Comparatives Bernard Holsey has 3 more yards than Angel Rubio.
In 1982, the Kansas City Chiefs played more away games than home games.

Negations
The Warriors were not the home team at the game on 11-24-2006.
Dean Semmens is not one of the four players born after 1981. give the correct answer, out of 9 runs. Better pre-training leads to more consistency across models. The ratio of samples answered correctly by all models is 24.2% for MASK-LM but 55.5% for Synthetic + Counterfactual.
context amounts to detecting whether the table and the sentence appear in the same article, was not found to be effective. Our hypothesis is that the task was not hard enough to provide a training signal. We build on top of the TAPAS model and propose harder and more effective pre-training tasks to achieve strong performance on the TAB-FACT dataset.
Entailment tasks Recognizing entailment has a long history in NLP (Dagan et al., 2010). Recently, the text to text framework has been expanded to incorporate structured data, like knowledge graphs (Vlachos and Riedel, 2015), tables (Jo et al., 2019; or images (Suhr et al., 2017(Suhr et al., , 2019. The large-scale TABFACT dataset  is one such example. Among the top performing models in the task is a BERT based model, acting on a flattened versioned of the table using textual templates to make the tables resemble natural text. Our approach has two key improvements: the usage of special embeddings, as introduced in Herzig et al. (2020), and our novel counterfactual and synthetic pre-training (Section 3). Intermediate Pre-training Language model fine-tuning (Howard and Ruder, 2018) also know as domain adaptive pre-training (Gururangan et al., 2020) has been studied as a way to handle covariate shift. Our work is closer to intermediate task fine-tuning (Pruksachatkun et al., 2020) where one tries to teach the model higher-level abilities. Similarly we try to improve the discrete and numeric reasoning capabilities of the model.

Counterfactual data generation
The most similar approach to ours appears in Xiong et al. (2020), replacing entities in Wikipedia by others with the same type for a MASK-LM model objective. We, on the one hand, take advantage of other rows in the table to produce plausible negatives, and also replace dates and numbers. Recently, Kaushik et al. Numeric reasoning Numeric reasoning in Natural Language processing has been recognized as an important part in entailment models (Sammons et al., 2010) and reading comprehension (Ran et al., 2019). Wallace et al. (2019) studied the capacity of different models on understanding numerical operations and show that BERT-based model still have headroom. This motivates the use of the synthetic generation approach to improve numerical reasoning in our model.
Synthetic data generation Synthetic data has been used to improve learning in NLP tasks (Alberti et al., 2019;Lewis et al., 2019;Wu et al., 2016;Leonandya et al., 2019). In semantic parsing for example (Wang et al., 2015;Iyer et al., 2017;Weir et al., 2020), templates are used to bootstrap models that map text to logical forms or SQL. Salvatore et al. (2019) use synthetic data generated from logical forms to evaluate the performance of textual entailment models (e.g., BERT). Geiger et al. (2019) use synthetic data to create fair evaluation sets for natural language inference. Geva et al. (2020) show the importance of injecting numerical reasoning via generated data into the model to solve reading comprehension tasks. They propose different templates for generating synthetic numerical examples. In our work we use a method that is better suited for tables and to the entailment task, and is arguably simpler.

Conclusion
We introduced two pre-training tasks, counterfactual and synthetic, to obtain state-of-the-art results on the TABFACT (Chen et al., 2020) entailment task on tabular data. We adapted the BERT-based architecture of TAPAS (Herzig et al., 2020) to binary classification and showed that pre-training on both tasks yields substantial improvements on TAB-FACT but also on a QA dataset, SQA (Iyyer et al., 2017), even with only a subset of the training data.
We ran a study on column selection methods to speed-up training and inference. We found that we can speed up the model by a factor of 2 at a moderate drop in accuracy (≈ 1 point) and by a factor of 4 at a larger drop but still with higher accuracy than previous approaches.
We characterized the complex operations required for table entailment to guide future research in this topic. Our code and models will be opensourced. We provide details on our experimental setup and hyper-parameter tuning in Section A. Section B and C give additional information on model and the TABFACT dataset. We give details and results regarding our column pruning approach in Section D.
Full results for SQA are displayed in Section E. Section F shows the accuracy on the pre-training tasks held-out sets. Section G contains the trigger words used for identifying the salient groups in the analysis section.

A.1 Hyper-Parameter Search
The hyper-parameters are optimized using a black box Bayesian optimizer similar to Google Vizier (Golovin et al., 2017) which looked at validation accuracy after 8, 000 steps only, in order to prevent over-fitting and use resources effectively. The ranges used were a learning rate from 10 −6 to 3 × 10 −4 , dropout probabilities from 0 to 0.2 and warm-up ratio from 0 to 0.05. We used 200 runs and kept the median values for the top 20 trials. In order to show the impact of the number of trials in the expected validation results, we follow Henderson et al. (2018) and Dodge et al. (2019). Given that we used Bayesian optimization instead of random search, we applied the bootstrap method to estimate mean and variance of the max validation accuracy at 8, 000 steps for different number of trials. From trial 10 to 200 we noted an increase of 0.4% in accuracy and a standard deviation that decreases from 2% to 1.3%.

A.2 Hyper-Parameters
We use the same hyper-parameters for pre-training and fine-tuning. For pre-training, the input length is 256 and 512 for fine-tuning if not stated otherwise. We use 80, 000 training steps, a learning rate of 2e −5 and a warm-up ratio of 0.05. We disable the attention dropout in BERT but use a hidden dropout probability of 0.07 . Finally, we use an Adam optimizer with weight decay with the same configuration as BERT.
For SQA we do not use any search algorithm and use the same model and the same hyper-parameters as the ones used in Herzig et al. (2020). The only difference is that we start the fine-tuning from a checkpoint trained on our intermediate pre-training entailment task.  Figure 6: Input representation for model.

A.3 Number of Parameters
The number of parameters is the same as for BERT: 110M for base models and 340M for Large models.

A.4 Training Time
We train all our models on Cloud TPUs V3.

B Model
For illustrative purposes, we include the input representation using the 6 types of embeddings, as depicted by Herzig et al. (2020).

C Dataset
Statistics of the TABFACT dataset can be found in

D Columns selection algorithm
Let cost(.) ∈ N be the function that computes the number of tokens given a text using the BERT tokenizer, t s the tokenized statement text, t c i the text of the column i. We denote the columns as (c 1 , .., c n ) ordered by their scores where n is the number of columns. Let m be the maximum number of tokens. Then the cost of the column must verify the following condition.
where C + i is the set of retained columns at the iteration i. 2 is added to the condition as two special tokens are added to the input: [CLS], t s , [SEP ], tc 1 , ..., tc n . If a current column c i doesn't respect the condition then the column is not selected. Whether or not the column is retained, the algorithm continues and verifies if the next column can fit. It follows C +n contains the maximum number of columns that can fit under m by respecting the columns scoring order.
There is a number of heuristic pruning approaches we have experimented with. Results are given in 7. where v i represents the embedding of the token i. For a given column token c we define the relevance with respect to the statement as the average similarity to every token: Where τ is a threshold that helps to remove noise from unrelated word embeddings. We set τ to 0.89. We experimented with max and sum as other aggregation function but found the average to perform best. The final score between the statement S and the column C is given by Term frequencyâȂŞinverse document frequency (IWF) Scores the columns' tokens proportional to the word frequency in the statement and offset by the word frequency computed over all the tables and statements from the training set.
Where T F (t s , c) is how often the token c occurs in the statement t s , and W F (c) is the frequency of c in a word count list. The final score of a column C is given by Character N-gram (CHAR) Scores columns by character overlap with the statement. This method looks for sub-list of wordâȂŹs characters in the statement. The length of the characters' list has a minimum and maximum length allowed. In the experiments we use 5 and 20 as minimum and maximum length. Let L s,c be the set of all the overlapping characters' lengths. The scoring for each column is given by f (t s , t c ) = min(max(L s,c , 5)), 20) cost(t c )  , Heuristic exact match (HEM), word-to-vec (W2V), inverse word frequency (IWF), character ngram (CHAR) at different pre-training (PT) and fine-tuning (FT) sizes. Error margins are estimated as half the interquartile range. Table 8 shows the accuracy on the first development fold and the test set. As for the main results, the error margins displayed are half the interquartile range over 9 runs, which is half the difference between the first and third quartile. This range contains half of the runs and provides a measure of robustness.

F Pre-Training Data
When training on the pre-training data we hold out approximately 1% of the data for testing how well the model is solving the pre-training task. In Table  9, we compare the test pre-training accuracy on synthetic and counterfactual data to models that are only trained on the statements to understand whether there is considerable bias in the data. Both datasets have some bias (i.e. the accuracy without table is higher than 50%.). Still there is a sufficient enough gap between training with and without tables so that the data is still useful. The synthetic data can be solved almost perfectly whereas for the counterfactual data we only reach an accuracy of 84.3%. This is expected as there is no guarantee that the model has enough information to decide whether a statement is true or false for the counterfactual examples.  Table 9: Accuracy on synthetic (Val S ) and counterfactual held-out sets (Val C ) of the pre-traininig data.
In table 10 we show the ablation results when removing the counterfactual statements that lack a supporting entity, that is a second entity that appears in both the table and sentence. This increases the probability that our generated negative pairs are incorrect, but it also discards 7 out of 8 examples, which ends up hurting the results.

G Salient Groups Definition
In table 11 we show the words that are used as markers to define each of the groups. We first identified manually the operations that were most often needed to solve the task and found relevant words linked with each group. The heuristic was validated by manually inspecting 50 samples from each group and observing higher than 90% accuracy.  Table 8: SQA dev (first fold) and test results. ALL is the average question accuracy, SEQ the sequence accuracy, and QX, the accuracy of the X'th question in a sequence. We show the median over 9 trials, and errors are estimated with half the interquartile range .

Slice Words
Aggregations total, count, average, sum, amount, there, only Superlatives first, highest, best, newest, most, greatest, latest, biggest and their opposites Comparatives than, less, more, better, worse, higher, lower, shorter, same Negations not, any, none, no, never