Incorporating External Knowledge to Enhance Tabular Reasoning

Reasoning about tabular information presents unique challenges to modern NLP approaches which largely rely on pre-trained contextualized embeddings of text. In this paper, we study these challenges through the problem of tabular natural language inference. We propose easy and effective modifications to how information is presented to a model for this task. We show via systematic experiments that these strategies substantially improve tabular inference performance.


Introduction
Natural Language Inference (NLI) is the task of determining if a hypothesis sentence can be inferred as true, false, or undetermined given a premise sentence (Dagan et al., 2013). Contextual sentence embeddings such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), applied to large datasets such as SNLI (Bowman et al., 2015) and MultiNLI (Williams et al., 2018), have led to nearhuman performance of NLI systems.
In this paper, we study the harder problem of reasoning about tabular premises, as instantiated in datasets such as TabFact (Chen et al., 2019) and InfoTabS (Gupta et al., 2020). This problem is similar to standard NLI, but the premises are Wikipedia tables rather than sentences. Models similar to the best ones for the standard NLI datasets struggle with tabular inference. Using the InfoTabS dataset as an example, we present a focused study that investigates (a) the poor performance of existing models, (b) connections to information deficiency in the tabular premises, and, (c) simple yet effective mitigations for these problems.
We use the table and hypotheses in Figure 1 as a running example through this paper, and re- * *The first two authors contributed equally to the work. The first author was a remote intern at University of Utah during the work.  (2011) H1: NYSE has fewer than 3,000 stocks listed. H2: Over 2,500 stocks are listed in the NYSE. H3: S&P 500 stock trading volume is over $10 trillion.
Figure 1: A tabular premise example. The hypotheses H1 is entailed by it, H2 is a contradiction and H3 is neutral i.e. neither entailed nor contradictory.
fer to the left column as its keys. 1 Tabular inference is challenging for several reasons: (a) Poor In the absence of large labeled corpora, any modeling strategy needs to explicitly address these problems. In this paper, we propose effective approaches for addressing them, and show that they lead to substantial improvements in prediction quality, especially on adversarial test sets. This focused study makes the following contributions: 1. We analyse why the existing state-of-the-art BERT class models struggle on the challenging task of NLI over tabular data. 2. We propose solutions to overcome these challenges via simple modifications to inputs using existing language resources.
3. Through extensive experiments, we show significant improvements to model performance, especially on challenging adversarial test sets. The updated dataset, along with associated scripts, are available at https://github.com/ utahnlp/knowledge_infotabs.

Challenges and Proposed Solutions
We examine the issues highlighted in §1 and propose simple solutions to mitigate them below.
Better Paragraph Representation (BPR): One way to represent the premise table is to use a universal template to convert each row of the table into sentence which serves as input to a BERT-style model. Gupta et al. (2020) suggest that in a table titled t, a row with key k and value v should be converted to a sentence using the template: "The k of t are v." Despite the advantage of simplicity, the approach produces ungrammatical sentences. In our example, the template converts the Founded row to the sentence "The Founded of New York Stock Exchange are May 17, 1792; 226 years ago.".
We note that keys are associated with values of specific entity types such as MONEY, DATE, CAR-DINAL, and BOOL, and the entire table itself has a category. Therefore, we propose type-specific templates, instead of using the universal one. 2 In our example, the table category is Organization and the key Founded has the type DATE. A better template for this key is "t was k on v", which produces the more grammatical sentence "New York Stock Exchange was Founded on May 17, 1792; 226 years ago.". Furthermore, we observe that including the table category information i.e. "New York Stock Exchange is an Organization." helps in better premise context understanding. 3 Appendix A provides more such templates.

Implicit Knowledge Addition (KG implicit):
Tables represent information implicitly; they do not employ connectives to link their cells. As a result, a model trained only on tables struggles to make lexical inferences about the hypothesis, such as the difference between the meanings of 'before' and 'after', and the function of negations. This is surprising, because the models have the benefit of being pre-trained on large textual corpora.
Recently, Andreas (2020) and Pruksachatkun et al. (2020) showed that we can pre-train models on specific tasks to incorporate such implicit knowledge. Eisenschlos et al. (2020) use pre-training on synthetic data to improve the performance on the TabFact dataset. Inspired by these, we first train our model on the large, diverse and human-written MultiNLI dataset. Then, we fine tune it to the InfoTabS task. Pre-training with MultiNLI data exposes the model to diverse lexical constructions. Furthermore, it increases the training data size by 433K (MultiNLI) example pairs. This makes the representation better tuned to the NLI task, thereby leading to better generalization.
Distracting Rows Removal (DRR) Not all premise table rows are necessary to reason about a given hypothesis. In our example, for the hypotheses H1 and H2, the row corresponding to the key No. of listings is sufficient to decide the label for the hypothesis. The other rows are an irrelevant distraction. Further, as a practical concern, when longer tables are encoded into sentences as described above, the resulting number of tokens is more than the input size restrictions of existing models, leading to useful rows potentially being cropped. Appendix F shows one such example on the InfoTabS. Therefore, it becomes important to prune irrelevant rows.
To identify relevant rows, we employ a simplified version of the alignment algorithm used by Yadav et al. (2019Yadav et al. ( , 2020 for retrieval in reading comprehension. First, every word in the hypothesis sentence is aligned with the most similar word in the table sentences using cosine similarity. We use fast-Text (Joulin et al., 2016;Mikolov et al., 2018) embeddings for this purpose, which preliminary experiments revealed to be better than other embeddings. Then, we rank rows by their similarity to the hypothesis, by aggregating similarity over content words in the hypothesis. Yadav et al. (2019) used inverse document frequency for weighting words, but we found that simple stop word pruning was sufficient. We took the top k rows by similarity as the pruned representative of the table for this hypothesis. The hyper-parameter k is selected by tuning on a development set. Appendix B gives more details about these design choices.

Explicit Knowledge Addition (KG explicit):
We found that adding explicit information to enrich keys improves a model's ability to disambiguate and understand them. We expand the pruned table premises with contextually relevant key information from existing resources such as WordNet (definitions) or Wikipedia (first sentence, usually a definition). 4 To find the best expansion of a key, we use the sentential form of a row to obtain the BERT embedding (on-the-fly) for its key. We also obtain the BERT embeddings of the same key from WordNet examples (or Wikipedia sentences). 5 Finally, we concatenate the WordNet definition (or the Wikipedia sentence) corresponding to the highest key embedding similarity to the table. As we want the contextually relevant definition of the key, we use the BERT embeddings rather than noncontextual ones (e.g., fastText). For example, the key volume can have different meanings in various contexts. For our example, the contextually best definition is "In capital markets, volume, is the total number of a security that was traded during a given period of time." rather than the other definition "In thermodynamics, the volume of a system is an extensive parameter for describing its thermodynamic state.".

Experiment and Analysis
Our experiments are designed to study the research question: Can today's large pre-trained models exploit the information sources described in §2 to better reason about tabular information?

Experimental setup
Datasets Our experiments uses InfoTabS, a tabular inference dataset from Gupta et al. (2020). The dataset is heterogeneous in the types of tables and keys, and relies on background knowledge and common sense. Unlike the TabFact dataset (Chen et al., 2019), it has all three inference labels, namely entailment, contradiction and neutral. Importantly, for the purpose of our evaluation, it has three test sets. In addition to the usual development set and the test set (called α 1 ), the dataset has two adversarial test sets: a contrast set α 2 that is lexically similar to α 1 , but with minimal changes in the hypotheses and flip entail-contradict label, and a zero-shot set α 3 which has long tables from different domains with little key overlap with the training set.
Models For a fair comparison with earlier baselines, we use RoBERTa-large (RoBERTa L ) for all our experiments. We represent the premise table by converting each table row into a sentence, and then appending them into a paragraph, i.e. the Para representation of Gupta et al. (2020).
Hyperparameters Settings 6 For the distracting row removal (+DRR) step, we have a hyperparameter k.
We experimented with k ∈ {2, 3, 4, 5, 6}, by predicting on +DRR development premise on model trained on orignal training set (i.e. BPR), as shown in Table 1. The development accuracy increases significantly as k increases from 2 to 4 and then from 4 to 6, increases marginally ( 1.5% improvement). Since our goal is to remove distracting rows, we use the lowest hyperparameter with good performance i.e. k = 4. 7 .
Train Dev k = 2 k = 3 k = 4 k = 5 k = 6 BPR DRR 71.72 74.83 77.50 78.50 79.00 Table 1: Dev accuracy on increasing hyperparameter k.  Table 2: Accuracy with the proposed modifications on the Dev and test sets. Here, + represents the change with respect to the previous row. Reported numbers are the average over three random seed runs with standard deviation of 0.33 (+KG explicit), 0.46 (+DRR), 0.61 (+KG implicit), 0.86 (BPR), over all sets. All improvements are statistically significant with p < 0.05, except α 1 for BPR representation w.r.t to Para (Original). Here the Human and Para results are taken from Gupta et al. (2020). Table 2, with BPR, we observe that the RoBERTa L model improves performance on all dev and test sets except α 3 . There are two main reasons behind this poor performance on α 3 .

BPR As shown in
First, the zero-shot α 3 data includes unseen keys. The number of keys common to α 3 and the training set is 94, whereas for, dev, α 1 and α 2 it is 334, 312, and 273 respectively (i.e., 3-5 times more). Second, despite being represented by better sentences, due to the input size restriction of RoBERTa L some relevant rows are still ignored.
KG implicit We observe that implicit knowledge addition via MNLI pre-training helps the model reason and generalize better. From Table 2, we can see significant performance improvement in the dev and all three test sets.
DRR This leads to significant improvement in the α 3 set. We attribute this to two primary reasons: First, α 3 tables are longer (13.1 keys per table on average, vs. 8.8 keys on average in the others), and DRR is important to avoid automatically removing keys from the bottom of a table due to the limitations in RoBERTa L model's input size. Without these relevant rows, the model incorrectly predicts the neutral label. Second, α 3 is a zero-shot dataset and has significant proportion of unseen keys which could end up being noise for the model. The slight decrease in performance on the dev, α 1 and α 2 sets can be attributed to model utilising spurious patterns over irrelevant keys for prediction. 8 We validated this experimentally by testing the original premise trained model on the DRR test tables. Table 5 in the Appendix C shows that without pruning, the model focuses on irrelevant rows for prediction.
KG explicit With explicit contextualized knowledge about the table keys, we observe a marginal improvement in dev, α 1 test sets and a significant performance gain on the α 2 and α 3 test sets. Improvement in the α 3 set shows that adding external knowledge helps in the zero-shot setting. With α 2 , the model can not utilize spurious lexical correlations 9 due to its adversarial nature, and is forced to use the relevant keys in the premise tables, thus 8 Performance drop of dev and α2 is also marginal i.e. (dev: 79.57 to 78.77, α1: 78.27 to 78.13, α2: 71.87 to 70.90), as compared to InfoTabS WMD-top3 i.e (dev: 75.5 to 72.55,α1: 74.88 to 70.38, α2: 65.44 to 62.55), here WMD-top3 performance numbers are taken from Gupta et al. (2020). 9 The hypothesis-only baseline for α2 is 48.5% vs. α1: 60.5 % and dev: 60.5 % (Gupta et al., 2020) adding explicit information about the key improves performance more for α 2 than α 1 or dev. Appendix F shows some qualitative examples.

Ablation Study
We perform an ablation study as shown in table 3, where instead of doing all modification sequentially one after another (+), we do only one modification at a time to analyze its effects.
Through our ablation study we observe that: (a) DRR improves performance on the dev, α 1 , and α 2 sets, but slightly degrades it on the α 3 set. The drop in performance on α 3 is due to spurious artifact deletion as explained in details in Appendix E. (b) KG explicit gives performance improvement in all sets. Furthermore, there is significant boost in performance of the adversarial α 2 and α 3 sets. 10 (c) Similarly, KG implicit shows significant improvement in all test sets. The large improvements on the adversarial sets α 2 and α 3 sets, suggest that the model can now reason better. Although, implicit knowledge provides most performance gain, all modifications are needed to obtain the best performance for all sets (especially on the α 3 set
The proposed modifications in this work are simple and intuitive. Yet, existing table reasoning papers have not studied the impact of such input modifications. Furthermore, much of the recent work focuses on building sophisticated neural models, without explicit focus on how these models (designed for raw text) adapt to the tabular data. In this work, we argue that instead of relying on the neural network to "magically" work for tabular structures, we should carefully think about the representation of semi-structured data, and the incorporation of both implicit and explicit knowledge into neural models. Our work highlights that simple pre-processing steps are important, especially for better generalization, as evident from the significant improvement in performance on adversarial test sets with the same RoBERTa models. We recommend that these pre-processing steps should be standardized across table reasoning tasks.

Conclusion & Future Work
We introduced simple and effective modifications that rely on introducing additional knowledge to improve tabular NLI. These modifications governs what information is provided to a tabular NLI and how the given information is presented to the model. We presented a case study with the recently published InfoTabS dataset and showed that our proposed changes lead to significant improvements. Furthermore, we also carefully studied the effect of these modifications on the multiple test-sets, and why a certain modification seems to help a particular adversarial set.
We believe that our study and proposed solutions will be valuable to researchers working on question answering and generation problems involving both tabular and textual inputs, such as tabular/hybrid question answering and table-to-text generation, especially with difficult or adversarial evaluation. Looking ahead, our work can be extended to include explicit knowledge for hypothesis tokens as well. To increase robustness, we can also integrate structural constraints via data augmentation through NLI training. Moreover, we expect that structural information such as position encoding could also help better represent tables.

A BPR Templates
Here, we are listing down some of the diverse example templates we have framed.
• For the  (Yadav et al., 2019) have used BERT and Glove embeddings. In our case, we prefer to use fastText word embeddings over Glove because fastText embedding uses sub-word information which helps in capturing different variations of the context words. Furthermore, fastText embeddings is also as better choice than BERT for our task because 1. Firstly, we are embedding single sentential form of diverse rows instead of longer context similar paragraphs, 2. Secondly, all words (especially keys) of the rows across all the tables are used only in one context, whereas BERT is useful when same word is used with different contexts across paragraphs, 3. Thirdly, in all tables, the number sentences to select from is bounded by maximum rows in the table, which is a small number (8.8 in train, dev, α 1 , α 2 and 13.1 in α 3 ), and 4. Lastly, using fastText is much faster to compute than BERT for obtaining embeddings.
Binary weighting: Since, we are embedding single sentential form of diverse rowsinstead of longer context related paragraphs, we found that using binary weighting 0 for stop words and 1 for others is more effective than the idf weighting, which is useful only for longer paragraph context with several lexical terms.

C Hyperparameters k vs test-sets accuracy
We also trained a model both train and tested on the DRR table premise for increasing values of the hyper parameter k, as shown in Table 1. We also test the model trained on the entire para on pruned para with increasing value of hyperparameters k ∈ {2, 3, 4, 5, 6} for the test sets α 1 , α 2 , and α 3 . In all cases, except α 3 , the performance with larger k is better. The increase in performance, even with k > 4, shows that the model is using more then required keys for prediction. Thus, the model is utlising the spurious pattern in irrelevant rows for the prediction.

E Artifacts and Model Predictions
In Table 7 we show percentage of example which were corrected after modification and vice versa. Surprisingly, there is a small percentage of examples which are predicted correctly earlier with original premise (Para) but predicted wrongly after all the modifications (Mod), although such examples are much lesser than opposite case. We suspect that earlier model was also relying on spurious pattern (artifacts) for correct prediction on these examples earlier, which are now corrupted after the proposed modifications. Hence, the new model struggle to predict correctly on such examples.  In the next section F, we also shows qualitative examples, where modification helps model predict correctly. We also provide some examples via distracting row removal modification, where model fails after modification.

F Qualitative Examples
In this section, we provide examples where model is able to predict well after the proposed modifications. We also provide some examples, where model struggles to make the correct prediction after distracting row removal (DRR) modification. Hypothesis Eva Mendes has two children.

Premise Label
Human Label (Gold) Entailed Orignal Premise Neutral +BPR Entailed Result and Explanation In this example from α 2 , the model predicts Neutral for this hypothesis with orignal premise. However, forming better sentences by adding the "number of children are 2" (highlighted as green) in case of CARDINAL type for the category PERSON helps the model understand the relation and reasoning behind the children and the number two and arrive at the correct prediction of entailment.  (m. 1942;annulled 1942), Stanley Reames (m. 1945div. 1949), Tony Curtis (m. 1951div. 1962), Robert Brandt (m. 1962   Result and Explanation In this example from α 2 , the model without implicit knowledge and the model with implicit knowledge addition predict the correct label on the Hypothesis A. However for Hypothesis B which is an example from α 2 , and originally generated by replacing the word "over" to word "under" in the Hypothesis A and flipping gold label from entail to contradiction, the ealier model which is using artifacts over lexical patterns arrive to predict the original wrong label entail instead of contradiction. On adding implicit knowledge while training, the model is now able to reason rather than relying on artifacts and correctly predicts contradiction. Note, that both hypothesis A and hypothesis B require exactly same reasoning for inference i.e. they are equally hard. The oxidation states of Fluorine is -1 (oxidizes oxygen).  Result and Explanation In this example from the α 3 set, removing distracting rows (sentence except the one in green and blue) definitely helps as there are irrelevant distracting noise and also make premise paragraph long beyond BERT maximum tokenization limits. Before DRR is applied, the model predicts neutral due to a) distracting rows and b) required information i.e. relevant keysrows highlighted as green being removed due to maximum tokenization limitation (it's second last sentence). However, after DRR, the prune information retained is only the relevant keys highlighted as green and thus the model is able to predict the correct label.

Negative Example
In some examples distracting row removal for DRR remove an relevant rows and hence the model failed to predict correctly on the DRR premise, as shown below: Original Premise Et in Arcadia ego is a painting. Et in Arcadia ego is also known as Les Bergers d'Arcadie.   Table 13: Prediction after DRR. Here, + represents the change with respect to the previous row. the Dev set, the model before DRR predicts the correct label but however on DRR, it predicts incorrect label of neutral. Despite the fact that both the relevant rows require for inference (highlighted in green) is present after DRR. This shows, that the model is looking at more keys than required in the initial case, which are eliminated in the DRR, which force the model to change it prediction. Thus, model is utilising spurious correlation from irrelevant rows to predict the label. Spouse is defined as a spouse is a significant other in a marriage, civil union, or common-law marriage .

Orignal Premise
Hypothesis Julius Caesar was buried in Rome.

Model Label
Human Label (Gold) Entailed Original Premise Neutral + KG explicit Entailed Result and Explanation In this example from α 2 , the model without explicit knowledge predicts neutral for the hypothesis as it is not able to infer that resting place is where people are buried, so it predicts neutral as it implicitly lack buried key understanding. On explicit KG addition (highlighted as blue+ green), we add the definition of resting place to be the place where remains of the dead are buried (highlighted as green). Now the model uses this extra information (highlighted as green) plus the original key related to death (highlighted in bold) to correctly infer that the statement Caesar is buried in Rome is entailed.