TABBIE: Pretrained Representations of Tabular Data

Existing work on tabular representation-learning jointly models tables and associated text using self-supervised objective functions derived from pretrained language models such as BERT. While this joint pretraining improves tasks involving paired tables and text (e.g., answering questions about tables), we show that it underperforms on tasks that operate over tables without any associated text (e.g., populating missing cells). We devise a simple pretraining objective (corrupt cell detection) that learns exclusively from tabular data and reaches the state-of-the-art on a suite of table-based prediction tasks. Unlike competing approaches, our model (TABBIE) provides embeddings of all table substructures (cells, rows, and columns), and it also requires far less compute to train. A qualitative analysis of our model’s learned cell, column, and row representations shows that it understands complex table semantics and numerical trends.


Introduction
Large-scale self-supervised pretraining has substantially advanced the state-of-the-art in natural language processing (Peters et al., 2018;Devlin et al., 2018;Liu et al., 2019). More recently, these pretraining methods have been extended to jointly learn representations of tables as well as text (Herzig et al., 2020;Yin et al., 2020), which enables improved modeling of tasks such as question answering over tables. However, many practical problems involve semantic understanding of tabular data without additional text-based input, such as extracting tables from documents, retrieving similar columns or cells, and filling in missing information . In this work, we design a pretraining methodology specifically for tables (Tabular Information Embedding or TABBIE) that resembles several core tasks in table extraction and decomposition pipelines and  allows easy access to representations for different tabular substructures (cells, rows, and columns).
Existing table representation models such as TaBERT (Yin et al., 2020) and TaPas (Herzig et al., 2020) concatenate tabular data with an associated piece of text and then use BERT's masked language modeling objective for pretraining. These approaches are computationally expensive due to the long sequences that arise from concatenating text with linearized tables, which necessitates truncating the input sequences 1 to make training feasible. We show that TaBERT underperforms on downstream table-based applications that operate independent of external text (e.g., deciding whether cell text was corrupted while extracting a table from a PDF), which motivates us to investigate an approach that preserves the full table during pretraining.
Our TABBIE architecture relies on two Transformers that independently encode rows and columns, respectively; their representations are pooled at each layer. This setup reduces the sequence length of each Transformer's input, which cuts down on its complexity, while also allowing us to easily extract representations of cells, rows, and columns. Additionally, TABBIE uses a simplified training objective compared to masked language modeling: instead of predicting masked cells, we repurpose ELECTRA's objective function (Clark et al., 2020) for tabular pretraining by asking the model to predict whether or not each cell in a table is real or corrupted. We emphasize that this pretraining objective is a fundamental task in table structure decomposition pipelines (Nishida et al., 2017;Tensmeyer et al., 2019;Raja et al., 2020), in which incorrectly predicting row/column separators or cell boundaries leads to corrupted cell text. Unlike Clark et al. (2020), we do not require a separate "generator" model that produces corrupted candidates, as we observe that simple corruption processes (e.g., sampling cells from other tables, swapping cells within the same column) yield powerful representations after pretraining.
In a controlled comparison to TaBERT (pretraining on the same number of tables and using a similarly-sized model), we evaluate TABBIE on three table-based benchmarks: column population, row population, and column type prediction. On most configurations of these tasks, TABBIE achieves state-of-the-art performance, outperforming TaBERT and other baselines, while in others it performs competitively with TaBERT. Additionally, TABBIE was trained on 8 V100 GPUs in just over a week, compared to the 128 V100 GPUs used to train TaBERT in six days. A qualitative nearest-neighbor analysis of embeddings derived from TABBIE confirms that it encodes complex semantic properties about textual and numeric cells and substructures. We release our pretrained models and code to support further advances on tablebased tasks. 2 2 Model TABBIE is a self-supervised pretraining approach trained exclusively on tables, unlike prior approaches (Herzig et al., 2020;Yin et al., 2020) that jointly model tables and associated text snippets. At a high level, TABBIE encodes each cell of a table using two different Transformer models (Vaswani et al., 2017), one operating across the rows of the table and the other across columns. At each layer, the representations from the row and column Transformers are averaged and then passed as input to the next layer, which produces a contextualized representation of each cell within the table. We place a binary classifier over TABBIE's final-layer cell representations to predict whether or not it has been corrupted, or replaced by an intruder cell during preprocessing, inspired by the ELECTRA objective of Clark et al. (2020). In the remainder of this section, we formalize both TABBIE's model architecture and pretraining objective.

Model Architecture
TABBIE takes an M ×N table as input and produces embeddings x ij for each cell (where i and j are row and column indices, respectively), as well as embeddings for individual columns c j and rows r i .

Initialization:
We begin by initializing the cell embeddings x ij using a pretrained BERT model (Devlin et al., 2018). 3 Specifically, for each cell (i, j), we feed its contents into BERT and extract the 768-d [CLS] token representation. This step allows us to leverage the powerful semantic text encoder of BERT to compute representations of cells out-of-context, which is important because many tables contain cells with long-form text (e.g., Notes columns). Additionally, BERT has been shown to encode some degree of numeracy (Wallace et al., 2019), which helps represent cells with numerical content. We keep this BERT encoder fixed during training to reduce computational expense. Finally, we add learned positional embeddings to each of the [CLS] vectors to form the initialization of x ij . More specifically, we have two sets of positional embeddings, p (r) i ∈ R H and p (c) j ∈ R H , which model the position of rows and columns, respectively, and are randomly initialized and fine-tuned via TABBIE's self-supervised objective.
Contextualizing the cell embeddings: The cell embeddings we get from BERT are uncontextualized: they are computed in isolation of all of the other cells in the table. While methods such as TaBERT and TaPaS contextualize cell embeddings by linearizing the table into a single long sequence, we take a different and more computationally manageable approach. We define a row Transformer, which encodes cells across each row of the table, as well as a column Transformer, which does the same across columns.

Row Transformer
France avg ( ) = Step 1: compute column and row embeddings using two separate Transformers Step 2: compute contextualized cell embeddings by averaging row/col embeddings x12 layers Step 3: feed these contextualized cell embeddings as input to the next layer [CLSCOL] [CLSCOL] [CLSCOL] [CLSROW] [CLSROW] [CLSROW] [CLSROW] quence of embeddings into a row Transformer block, which uses self-attention to produce contextualized output representations r i,1 , r i,2 , . . . , r i,N . Similarly, assume column j contains cell embeddings x 1,j , x 2,j , . . . , x M,j ; the column Transformer produces contextualized representations c 1,j , c 2,j , . . . , c M,j . After running the two Transformers over all rows and columns, respectively, each cell (i, j) of a table is associated with a row embedding r i,j as well as a column embedding c i,j .
The final step of cell contextualization is to compose the row and column embeddings together before feeding the result to the next layer. Intuitively, if we do not aggregate the two sets of embeddings together, subsequent layers of the model will only have access to information from a specific row or column, which prevents contextualization across the whole table. We implement this aggregation through simple averaging: specifically, at layer L of TABBIE, we compute cell embeddings as: The new cell representations x L+1 i,j are then fed to the row and column Transformers at the next layer L + 1.
Extracting representations of an entire row or column: The row and column Transformers defined above produce separate representations for every cell in a particular row or column. However, many table-related downstream tasks (e.g., retrieve similar columns from a huge dataset of tables to some query column) can benefit from embeddings that capture the contents of an entire row or column. To enable this functionality in TABBIE, we simply prepend [CLSROW] and [CLSCOL] tokens to the beginning of each row and column in an input table as a preprocessing step. After pretraining, we can extract the final-layer cell representations of these [CLS] tokens to use in downstream tasks.

Pretraining
Having described TABBIE's model architecture, we turn now to its training objective. We adapt the selfsupervised ELECTRA objective proposed by Clark et al. (2020) for text representation learning, which places a binary classifier over each word in a piece of text and asks if the word either is part of the original text or has been corrupted. While this objective was originally motivated as enabling more efficient training compared to BERT's masked language modeling objective, it is especially suited for tabular data, as corrupt cell detection is actually a fundamental task in table structure decomposition pipelines such as (Nishida et al., 2017;Tensmeyer et al., 2019;Raja et al., 2020), in which incorrectly predicted row/column separators or cell boundaries can lead to corrupted cell text.
In our extension of ELECTRA to tables, a binary classifier takes a final-layer cell embedding as input to decide whether it has been corrupted. More concretely, for cell (i, j), we compute the corruption probability as where L indexes TABBIE's final layer, σ is the sigmoid function, and w is a weight vector of the same dimensionality as the cell embedding. Our final loss function is the binary cross entropy loss of this classifier averaged across all cells in the table.

Cell corruption process
Our formulation diverges from Clark et al. (2020) in how the corrupted cells are generated. In ELEC-TRA, a separate generator model is trained with BERT's masked language modeling objective to produce candidate corrupted tokens: for instance, given Jane went to the [MASK] to check on her experiments, the generator model might produce corrupted candidates such as lab or office. Simpler corruption strategies, such as randomly sampling words from the vocabulary, cannot induce powerful representations of text because local syntactic and semantic patterns are usually sufficient to detect obvious corruptions. For tabular data, however, we show that simple corruption strategies (Figure 3) that take advantage of the intra-table structure actually do yield powerful representations without the need of a separate generator network. More specifically, we use two different corruption strategies: • Frequency-based cell sampling: Our first strategy simply samples corrupt candidates from the training cell frequency distribution (i.e., more commonly-occurring cells are sampled more often than rare cells). One drawback of this method is that oftentimes it can result in samples that violate a particular column type (for instance, sampling a textual cell as a replacement for a cell in a numeric column). Despite its limitations, our analysis in Section 4 shows that this strategy alone results in strong performance on most downstream table-based tasks, although it does not result in a rich semantic understanding of intra-table semantics.
• Intra-table cell swapping: To encourage the model to learn fine-grained distinctions between topically-similar data, our second strategy produces corrupted candidates by swapping two cells in the same table (Figure 3c, d). This task is more challenging than the frequency-based sampling strategy above, especially when the swapped cells occur within the same column. While it underperforms frequency-based sampling on downstream tasks, it qualitatively results in more semantic similarity among nearest neighbors of column and row embeddings.

Pretraining details
Data: We aim for as controlled of a comparison with TaBERT (Yin et al., 2020)  Experimental settings: We train TABBIE for seven epochs for just over a week on 8 V100 GPUs using mixed precision. TABBIE has 12 layers and a hidden dimensionality of 768 for both row and column Transformers, in an effort to be comparablysized to the TaBERT-Base model. 5 Before computing the initial cell embeddings using BERT, we truncate each cell's contents to the first 300 characters, as some cells contain huge amounts of text. We also truncate tables to 30 rows and 20 columns to avoid memory issues (note that this is much larger than the three rows used by TaBERT), and We compared two pretrained models trained with different cell corruption strategy for downstream tasks. The first strategy (FREQ) uses exclusively a frequency-based cell sampling. The second strategy is a 50/50 mixture (MIX) of frequencybased sampling and intra-table cell swapping, where we additionally specify that half of the intratable swaps must come from the same row or column to make the objective more challenging.

Experiments
We validate TABBIE's table representation quality through its performance on three downstream tablecentric benchmarks (column population, row population, and column type prediction) that measure semantic table understanding. In most configurations of these tasks, TABBIE outperforms TaBERT and other baselines to set new state-of-the-art numbers. Note that we do not investigate TABBIE's performance on table-and-text tasks such as WikiTable-Questions (Pasupat and Liang, 2015), as our focus is not on integrating TABBIE into complex taskspecific pipelines (Liang et al., 2018), although this is an interesting avenue for future work.

Fine-tuning TABBIE
In all of our downstream experiments, we apply essentially the same fine-tuning strategy to both TABBIE and TaBERT: we select a subset of its finallayer representations (i.e., cell or column representations) that correspond to the tabular substruc-  tures used in the downstream task, and we place a classifier over these representations to predict the training labels. We select task-specific hyperparameters based on the size of each dataset (full details in Table 1) and report the test performance of the best-performing validation checkpoint. For both models, we backpropagate the downstream error signal into all of the model's parameters (i.e., we do not "freeze" our pretrained model).

Column Population
In the column population task, which is useful for attribute discovery, tabular data augmentation, and table retrieval (Das Sarma et al., 2012), a model is given the first N columns of a "seed" table and asked to predict the remaining column headers. Zhang and Balog (2017) compile a dataset for this task comprising 1.6M tables from Wikipedia with a test set of 1,000 tables, formulated as a multi-label classification task with 127,656 possible header labels. Importantly, we remove all of the tables in the column population test set from our pretraining data to avoid inflating our results in case TABBIE memorizes the missing columns during pretraining. 6 To fine-tune TABBIE on this task, we first concatenate the column [CLSCOL] embeddings of the seed table into a single vector and pass it through a single linear and softmax layer, training with a multi-label classification objective (Mahajan et al., 2018). Our baselines include the generative probabilistic model (GPM) of Zhang and Balog (2017) as well as a word embedding-based extension called Table2VecH (TH) devised by Deng et al. (2019). As fine-tuning on the full dataset is extremely expensive for TABBIE and TaBERT, we fine-tune on a random subset of 100K training examples; as a further disadvantage to these, we do not use  Table 2: TABBIE outperforms all methods on the column population task, with the biggest improvement coming with just a single seed column (N = 1). Despite its simplicity, the FREQ corruption strategy yields better results than MIX.
baselines, and TABBIE consistently outperforms TaBERT regardless of how many seed columns are provided, especially with only one seed column. This result indicates that TABBIE encodes more semantics about headers and columns than TaBERT.

Row Population
The row population task is more challenging than column population: given the first N rows of a table in which the first column contains entities (e.g., "Country"), models must predict the remaining entries of the first column. Making reasonable predictions of which entities best fill the column requires understanding the full context of the seed  Table 3: TABBIE outperforms baselines on row population when provided with more seed rows N , although TaBERT is superior given just a single seed row. Again, the FREQ strategy produces better results than MIX.
tions). When given only one seed row, TaBERT slightly outperforms TABBIE, but with more seed rows, TABBIE exhibits small improvements over TaBERT.

Column Type Prediction
While the prior two tasks involve predicting missing elements of a table, the column type prediction task involves predicting a high-level type of a particular column (e.g., name, age, etc.) without access to its header. This task is useful when indexing tables with missing column names, which happens relatively often in practice, or for schema matching , and like the other tasks, requires understanding the surrounding context. We evaluate our models on the same subset of VizNet Web Tables  9 created by  to evaluate their column type predictor, SATO 10 . They formulate this task as a multi-class classification problem (with 78 classes), with a training set of 64,000 tables and a test set consisting of 16,000 tables. We set aside 6,400 training tables to form a validation for both TABBIE and TaBERT, and we fine-tune each of these models with small random subsets of the training data (1000 and 10000 labeled tables) in addition to the full training set to evaluate their performance in a simulated lowresource setting.
Along with TaBERT, we compare with two recently-proposed column type prediction meth-  Table 4: Support-weighted F1-score of different models on column type prediction. TaBERT and TABBIE perform similarly in low resource settings (n=1000) and when the full training data is used (n=all).
ods: Sherlock , which uses a multi-input neural network with hand-crafted features extracted from each column, and the aforementioned SATO , which improves Sherlock by incorporating table context, topic model outputs, and label co-occurrence information. Table 4 shows the support-weighted F1score for each method. Similar to the previous two tasks, TABBIE and TaBERT significantly outperform the prior state-of-the-art (SATO). Here, there are no clear differences between the two models, but both reach higher F1 scores than the other baselines even when given only 1,000 training examples, which demonstrates the power of table-based pretraining.

Analysis
The results in the previous section show that TAB-BIE is a powerful table representation method, outperforming TaBERT in many downstream task configurations and remaining competitive in the rest. In this section, we dig deeper into TABBIE's representations by comparing them to TaBERT across a variety of quantitative and qualitative analysis tasks, including our own pretraining task of corrupt cell classification, as well as embedding clustering and nearest neighbors. Taken as a whole, the analysis suggests that TABBIE is able to better capture fine-grained table semantics.

Corrupt Cell Detection
We first examine how TaBERT performs on TABBIE's pretraining task of corrupt cell detection, which again is practically useful as a postprocessing step after table structure decomposition (Tensmeyer et al., 2019;Raja et al., 2020) because mistakes in predicting row/column/cell boundaries (sometimes compounded by OCR errors) can lead to inaccurate extraction. We fine-tune TaBERT on 100K tables using the MIX corruption strategy for  ten epochs, and construct a test set of 10K tables that are unseen by both TaBERT and TABBIE during pretraining. While TABBIE of course sees an order of magnitude more tables for this task during pretraining, this is still a useful experiment to determine if TaBERT's pretraining objective enables it to easily detect corrupted cells. As shown in Table 5, TaBERT performs significantly worse than TABBIE on all types of corrupt cells (both random corruption and intra-table swaps). Additionally, intra-column swaps are the most difficult for both models, as TABBIE achieves a 68.8 F1 on this subset compared to just 23.7 F1 by TaBERT. Interestingly, while the MIX strategy consistently performs worse than FREQ for the TABBIE models evaluated on the three downstream tasks in the previous section, it is substantially better at detecting more challenging corruptions, and is almost equivalent to detecting random cells sampled by FREQ. This result indicates that perhaps more complex table-based tasks are required to take advantage of representations derived using MIX corruption.

Nearest neighbors
We now turn to a qualitative analysis of the representations learned by TABBIE. In Figure 6 (top), we display the two nearest neighbor columns from our validation set to the date column marked by the red box. TABBIE is able to model the similarity of feb.   16 and saturday. february 5th despite the formatting difference, while TaBERT's neighbors more closely resemble the original column. Figure 6 (bottom) shows that TABBIE's nearest neighbors are less reliant on matching headers than TaBERT, as the neighbors all have different headers (nom, nombre, name).

Clustering
Are the embeddings produced by TABBIE useful for clustering and data discovery? To find out, we perform clustering experiments on the FinTabNet dataset from Zheng et al. (2021). This dataset contains ∼110K tables from financial reports of corporations in the S&P-500. We use the [CLS] embedding at the (0, 0) position (i.e., the top left-most cell in the table), extracted from a TABBIE model trained with the FREQ strategy, as a representative embedding for each table in the dataset. Next, we perform k-means clustering on these embeddings using the FAISS library (Johnson et al., 2017), with k=1024 centroids. While the FinTabNet dataset is restricted to the homogenous domain of financial tables, these tables cluster into sub-types such as consolidated financial tables, jurisdiction tables, insurance tables, etc. We then examine the contents of these clusters ( Figure 7) and observe that TABBIE embeddings can not only be clustered into these sub-types, but also that tables from reports of the same company, but from different financial years, are placed into the same cluster.

Identifying numeric trends
Next, we analyze how well TABBIE understands trends in numerical columns by looking at specific examples of our corrupt cell detection task. The first column of the table in Figure 5 contains jersey numbers sorted in ascending order. We swap two cells in this column, 16 and 18, which violates 0 date opponent time Figure 6: Nearest neighbors of the date and nom columns from the tables on the left, from both TAB-BIE and TaBERT. TABBIE's nearest neighbors exhibit more diverse formatting and less reliance on the header, which is an example of its semantic representation capability.
the increasing trend. Both TaBERT (fine-tuned for corrupt cell detection) and TABBIE FREQ struggle to identify this swap, while TABBIE MIX is almost certain that the two cells have been corrupted. This qualitative result is further evidence that the MIX model has potential for more complex table-based reasoning tasks.

Related work
The staggering amount of structured relational data in the form of tables on the Internet has attracted considerable attention from researchers over the past two decades (Cafarella et al., 2008;Limaye et al., 2010;Venetis et al., 2011;Suchanek et al., 2007;Embley et al., 2006), with applications including retrieval (Das Sarma et al., 2012), schemamatching (Madhavan et al., 2001(Madhavan et al., , 2005, and entity linking . Similar to popular large-scale language models pretrained on tasks involving unstructured natural language (Peters et al., 2018;Devlin et al., 2018;Liu et al., 2019), our work is part of a recent trend of self-supervised models trained on structured tabular data. TaBERT (Yin et al., 2020) and TaPaS (Herzig et al., 2020) Table of Contents (first row), but it also places tables of the same type from the same company into the same cluster (second and third rows). We provide the source images of the corresponding tables in this figure. with text (typically captions or questions), and are thus more suited for tasks like question answering (Pasupat and Liang, 2015). For pretraining, TaBERT attempts to recover the name and datatype of masked column headers (masked column prediction), in addition to contents of a particular cell (cell value recovery). The pretraining objectives of TaPaS, on the other hand, encourage tabular textual entailment. In a concurrent work, the TUTA model (Wang et al., 2020) uses masked language modeling, cell-level cloze prediction, and tablecontext retrieval as pretraining objectives. Further, in addition to traditional position embeddings, this work accounts for the hierarchical nature of tabular data using tree-based positional embeddings. Similiarly, in Deng et al. (2020), the authors perform a variant of MLM called masked entity recovery. In contrast, TABBIE is pretrained strictly on tabular data and intended for more general-purpose tablebased tasks, and uses corrupt-cell classification as its pretraining task.

Conclusion
In this paper, we proposed TABBIE, a selfsupervised pretraining method for tables without associated text. To reduce the computational cost of training our model, we repurpose the ELECTRA objective for corrupt cell detection, and we use two separate Transformers for rows and columns to minimize complexity associated with sequence length. On three downstream table-based tasks, TABBIE achieves competitive or better performance to existing methods such as TaBERT, and an analysis reveals that its representations include a deep semantic understanding of cells, rows, and columns. We publicly release our TABBIE pretrained models and code to facilitate future research on tabular representation learning.

Ethics Statement
As with any research work that involves training large language models, we acknowledge that our work has a negative carbon impact on the environment. A cumulative of 1344 GPU-hours of computation was performed on Tesla V100 GPUs. Total emissions are estimated to be 149.19 kg of CO 2 per run of our model (in total, there were two runs). While this is a significant amount (equivalent to ≈ 17 gallons of fuel consumed by an average motor vehicle 11 ), it is lower than TaBERT's cost per run by more than a factor of 10 assuming a similar computing platform was used. Estimations were conducted using the Machine Learning Impact calculator presented in Lacoste et al. (2019).