A Graph Representation of Semi-structured Data for Web Question Answering

The abundant semi-structured data on the Web, such as HTML-based tables and lists, provide commercial search engines a rich information source for question answering (QA). Different from plain text passages in Web documents, Web tables and lists have inherent structures, which carry semantic correlations among various elements in tables and lists. Many existing studies treat tables and lists as flat documents with pieces of text and do not make good use of semantic information hidden in structures. In this paper, we propose a novel graph representation of Web tables and lists based on a systematic categorization of the components in semi-structured data as well as their relations. We also develop pre-training and reasoning techniques on the graph model for the QA task. Extensive experiments on several real datasets collected from a commercial engine verify the effectiveness of our approach. Our method improves F1 score by 3.90 points over the state-of-the-art baselines.


Introduction
Question answering (QA) has become an important feature in most search engines as it delivers information to users in an effective and easy-to-understand manner. Answers to questions are often extracted from Web tables and lists. For example, Figure 1 shows the search result page (SERP) of questions Q 1 "cities with the highest GDP in the world" and Q 2 "the best skydiving locations in the world", where the answer to Q 1 is from a Web table, while that to Q 2 is from a Web list.
Comparing to unstructured plain text, semi-structured Web data, such as Web tables and lists, are more effective to represent rich relational information. Relations among various elements in a Web table or a list may be useful in answering user questions. According to the statistics from a global commercial search engine, there are hundreds of millions of semi-structured data pieces, including tables and lists on the Web, and the intents of 30% of user queries can be answered by semi-structured data.
Previous attempts towards question answering (QA) using semi-structured data on the Web are mainly IR-based approaches (Balakrishnan et al., 2015;Chakrabarti et al., 2020). Typically, those methods convert semi-structured data into documents by sequentially rearranging text cells to adapt to language models (Chakrabarti et al., 2020;Wang et al., 2018;Zhang and Balog, 2018). Those studies do not make use of inherent structural relationships among components of Web tables or lists. For example, the rearrangement does not consider the vertical relations among cells locating in the same columns, such as the relation among "New York", "Tokyo" and "Los Angeles" in Figure 1(a).
Some recent studies leverage tabular structure implicitly. For example, Nishida et al. (2017) cast tables as matrices of text and apply convolutional neural networks for table embedding. Zhang and Balog (2018) cut tables into smaller fragments. However, the structural information in Web tables and lists is more complex than the simple adjacency relation of matrices. How to take the best advantage of both the text information and the structural relations in Web tables and lists in QA remains a challenge not thoroughly explored.
In this paper, we tackle the problem of Web QA over semi-structured data, and make the following contributions. First, in Section 2, we systematically categorize different components in semi-structured Web data, including captions, headers, subject columns, attribute columns, and cells, as well as their relations, including cell-cell relation, header-cell relation, subject-attribute relation, and caption-content relation. We propose GraSSLM, a graph model to jointly represent both text and structural information in semi-structured data for Web QA in Section 3. Our GraSSLM model explicitly represents different types of components as nodes in a graph and their relations as edges. Our model integrates heterogeneous information effectively, including text and structures, and reveals hidden semantic correlations across various components naturally. Figure 1: Examples of Web table (a) and Web list (b) from a commercial search engine. Two queries are non-factoid queries, which can be answered by the information from semi-structured data like tables and lists.
Second, in Section 3, we apply two pre-training techniques for graph models. In particular, we design a novel node prediction objective (NPO) to leverage graph structure in node embedding. This pre-training task requires a model to predict the entire content for a masked node from the unmasked neighbor nodes, which guides the model to learn attention over the context. The attention is used in the graph reasoning stage, where the representation of each node is updated by the aggregated information from the neighbor nodes. In this way, the inherent semantic correlations among neighbor nodes are propagated via the structural connections in the graph. Comparing to the previous methods, this graph pre-training and reasoning mechanism better exploits structural information in tables and lists.
Last, to compare our model with the state-of-the-art methods, we create new datasets which contain real-world queries collected from a global commercial search engine paired with table and list data mined from the Web. Each pair query, table/list is further labeled by crowdsourcing annotators with consensus on relevance. The experimental results on the test sets, reported in Section 4, show that GraSSLM outperforms the best state-of-the-art baselines up to 1.77 in F1 score on average.

Problem Statement
Let S be a semi-structured data example, either a Web table T or a Web list L. There are different types of tables and lists, for example, relational tables, entity tables, matrix tables, enumerate lists and group lists (Lautert et al., 2013). Our method is generally applicable to all those types, therefore, we do not distinguish them in this paper.

Components of Web Semi-structured Data
Following Crestan and Pantel (2011) and Eberius et al. (2015), we divide a Web semi-structured data example into various components, as illustrated in Figure 2.
A caption C is a direct description that is usually adjunct to the content body of the semi-structured data. For example, in Figure 1 "Top 10 Cities by Projected GDP" and "30 Jaw Dropping Places for Skydiving in the world!" are captions for the table and the list, respectively.
Data content refers to the body of semi-structured data, which consists of multiple rows and columns. A special row, the header, often locates at the top of the table. The elements in a header often describe the classes that the content of the table belongs to. For example, in Figure 2, the first row consisting of "Rank", "City", and "Country" is the header of the table. The elements in the remaining rows of the table are cells. Vertically, cells are grouped into columns, where we identify subject columns and attribute columns. Subject columns refer to one or more key subjects or entities of the table, while attribute columns list the attribute information of the corresponding subjects or entities. In Figure 2, the column of "City" is a subject column, while the columns of 'Rank" and "Country" are both attribute columns. To recognize subject columns, we adopt a heuristic method (Nishida et al., 2017), which calculates the distinct string ratio as seeds for subject classification. Our empirical study finds that this simple method achieves an accuracy over 95%. Besides, a schema classification method (Eberius et al., 2015) is applied to detect and transpose vertical Web tables into horizontal ones.
A list can be regarded as a special type of table, which has only one single column, and has no header.

Relations among Components in Tables and Lists
Different components in tables and lists bear inherent semantic relations. Modeling those relations in a graph model fusions semantics among components and achieves a rich representation of semi-structured data. Particularly, we are interested in the following four types of relations.
Caption-Content Relation. A caption is often a summary of the context and content in a table or a list. The words in a caption are often reliable evidences to determine the relevance between a query and a semi-structured data example.
Header-Cell Relation. Since a header often outlines the classes that the cells belong to, a header-cell relation is usually a class-instance relation. For example, the cell, "Los Angeles" in Figure 1 is an instance of the class "City".
Subject-Attribute Relation. More often than not tables store entity information. In such a table, each row, except for the header, corresponds to one entity, where the cells in the subject columns contain the entity names, and the remaining cells in the attribute columns consist of the attributes for that entities. For example, in Figure 1(a), the third row corresponds to a "City" entity "Los Angeles", and "#3" and "United States" are the values of attributes "Rank" and "Country" of the entity, respectively. The subject-attribute relation is usually an entity-attribute relation. Figure 3: GraSSLM contextually encodes query and semi-structured data via a pre-trained transformer model. It then builds a graph for the semi-structured data into a graph, and updates the node embeddings by the graph reasoning module. Finally the graph classification module aggregates the nodes information to predict the QA match score. In addition, GraSSLM applies two graph pre-training objectives to encourage the model to learn attention over contextual nodes.
Cell-Cell Relation. If we ignore the subject columns, the remaining cells within the same rows or columns are also semantically related. The cells in the same row often describe the various attributes of the same entity, while the cells in the same column are often instances of the same class.
As mentioned in Section 2.1, lists can be considered as a special type of tables. They only have Caption-Content relation and Cell-Cell relation.
The problem of QA over semi-structured data is that, given a query Q and a semi-structured data example S, return the QA match score d(Q, S), which predicts the likelihood that S answers Q.

Method
In this section, we proposed GraSSLM, which is a graph model of semi-structured data on the Web for QA. Figure 3 shows the overall structure of GraSSLM.
GraSSLM is composed of three components. First, a pre-trained language model generates token-level contextual embedding for the concatenation of an input query Q and a semi-structured data example S. Second, a graph construction module converts the initial plain text embedding into graphs. Last, a graph reasoning and classification module predicts matching results.

Graph Construction
Given a query Q and a semi-structured data example S of M rows and N columns, we construct a graph based on the components and their relations described in Section 2. Figure 3 illustrates the graph constitution: The edges in the graph are created as follows. The first group of edges are formed based on the structural relations in the semi-structured data example. These structural relations carry the inherent semantic relations between the components, and define the context to better represent the elements in a semi-structured data example through the following four types of edges. (i) Caption-Content Relation: edges between caption nodes and cell nodes. (ii) Header-Cell Relation: edges between header nodes and the cell nodes in the corresponding columns. (iii) Subject-Attribute Relation: edges between subject cell nodes and the attribute cell nodes in the corresponding rows. (iv) Cell-Cell Relation: edges between neighbour cell nodes in the same row or in the same column.
The second group of edges connect the query and the semi-structured data example by connecting N Q with all nodes in S. The weights of these edges will be derived in the graph reasoning stage to represent the bi-directional attention between the query words and data components, including the edges between query node and cell node, as well as the edges between query node and caption node.
The graph for a list is a simplified version of that for a table, where there are no Header-Cell edges or Subject-Cell edges.

Graph Initialization
To obtain the initial representation of graph nodes, we first concatenate the query Q and the text in the semi-structured data example S. The concatenated string consists of G = (Q, C, {h j }, {c ij }), where C is the caption, {h j } are the tokens in the header, and {c ij } are the cells in S. We feed this concatenated string G into a pre-trained BERT model (Vaswani et al., 2017) and derive a contextual embedding for each token in G. We use LM(G) to denote this representation. In this paper, BERT base (Devlin et al., 2018) is used for contextual embedding.
We further derive the initial representation for each node in the graph. Since different nodes may contain various lengths of token spans, we adopt the method in (Fang et al., 2019), which applies a BiLSTM (Chen et al., 2017) on top of the transformer output and a multi-layer perceptron M LP to convert various lengths of token spans into a fix-sized vector as the node representation. We write the BiLSTM model as a function B. We denote by B(LM(G)) the model on top of the transformer output, and by B(LM(G))[s; t] the sequence of hidden states in the model for span extremes in position s and t. We use the subscripts start and end to denote the start and end positions of the tokens of the corresponding components. The initial representation for the nodes are as follows, where normal fonts are used for the text of the corresponding nodes, and bold fonts for the embedding.

Graph Reasoning and Prediction
After generating the dense representation of graph nodes, GraSSLM leverages a two-layer graph convolutional network (GCN) (Kipf and Welling, 2016) to perform message passing over the graph. At each layer, the graph convolutional neural network aggregates the neighbors' representations of one node and further transforms the aggregated representation with a linear projection. Let L (0) = {Q, C, {h j }, {c ij }} ∈ R K×d , where K = 2 + M × N is the total number of nodes in the graph including the query node, caption nodes, header nodes and cell nodes, and d is the output dimensionality of the MLP in Section 3.2. The graph reasoning process is formalized as where L (l) denotes the l-th (l = 1, 2) layer of GCN, σ is the non-linear activation function, which is ReLU in our case and W (l−1) is the weight matrix of the (l − 1)-th layer. D ∈ R K×K denotes the graph degree matrix, which records the amount of edges for every node and A ∈ R K×K denotes the graph adjacency matrix, which records the graph edge information. Symbol ∼ here indicates a renormalization trick of adding a self-connection to each node of the graph and building the corresponding degree and adjacency matrix. After two rounds of convolution, L (2) denotes the node features updated. Graph prediction is derived by a mean pooling operation on the nodes of the graph, followed by an MLP, that is, y = M LP (P ooling(L (2) )), where y is the predicted QA match score for the corresponding input query Q and data example S.

Pre-training Strategy
Pre-training (Erhan et al., 2010) has become a new paradigm of natural language processing, and various pre-training techniques have been proposed (Devlin et al., 2018;Joshi et al., 2020). However, most previous pre-training techniques were designed for plain text. Due to the structural characteristics of Web semi-structured data, they cannot be applied directly to such data. In this paper, we propose a novel pre-training method that allows the model to learn representations from semantics embedded in both text and structures of tables and lists. Following the successful pre-training experience of transformerbased models (Devlin et al., 2018), we used two pre-training objectives designed specifically for semistructured data.
Whole Cell Masking (WCM). We follow the masked language model proposed by BERT (Devlin et al., 2018), but with different masking schema. Extended from whole word masking (Joshi et al., 2020), Whole Cell Masking firstly masks every token of the word if any of its pieces is masked. Additionally, it masks the whole cell content if any token in table cells or headers is masked. We mask 15% of all cells in total through replacing 80% of the masked cell tokens by a special mask token [MASK], 10% by random tokens and 10% with the original tokens. Given input G = (Q, C, {h j }, {c ij }), let T = (t 1 , . . . , t |X| ) be the sequence of tokens for G, where t m ∈ T is the m-th token, which is masked, that is, t m = M LP (e m ), where e m ∈ R d denotes the token-level embedding of input t m , which is generated by the contextual language model. After an MLP with one hidden layer, e m is decoded as a token prediction score t m ∈ R V , where V denotes the vocabulary size.
Neighbor Prediction Objective (NPO). To incorporate the structural information of semi-structured data in the pre-training stage, we propose a novel neighbor prediction objective task for graph pretraining. The task is to predict each token inside a masked node using the representations of the neighbor nodes. In order to make the pre-training consistent with fine-tuning stage, we apply the same contextual embedding module and graph reasoning module as used in the fine-tuning process to generate the reasoned node representation.
Formally, denote by L n the node representation of the n-th node N n after contextual embedding and graph reasoning, and by function neighbour(·) the node representations of its neighbor nodes in the constructed graph. We use fixed sinusoidal embedding  as positional embedding to predict the tokens from L (2) n .
where function P ooling(·) converts the neighbor node representations of L n into a d-dimensional vector with mean pooling. We concatenate the representations of the neighbors and the k-th positional embedding p k ∈ R d to get the representation for the k-th token r n k . After an MLP with two hidden layers for decoding, we obtain the prediction result of the k-th token t n k ∈ R V .
GraSSLM sums up the loss from both the whole cell masking objective and the neighbor prediction objective as the total loss function. For the m-th token in the input token sequence, we can find the corresponding position k in the n-th node. The total pre-training loss is Notably, NPO directly uses the masked input from WCM for graph construction and prediction.

Experiments
We evaluate the GraSSLM model and other baselines on three datasets, including one table QA dataset, one list QA dataset and one small dataset of complex query-table pairs. In addition, we leverage two other large-scale datasets for pre-training. We describe the three datasets as follows.
•   required to be labeled by three judges. Those cases with 2/3 or higher positive labels receive positive final labels, otherwise negative.
• List Query Matching dataset (List-QM) is an English list QA task dataset from one commercial Q&A system, which has about 62k labeled cases. The data collection process is similar to Table-QM. For query selection, we include unordered lists, ordered lists and description lists (Consortium and others, 1999) to increase diversity. For pre-training of GraSSLM, we leverage the following semi-structured datasets.

• Deep Tables Query Matching dataset (DTable-QM) is a subset of the
• Large-scale We compare GraSSLM with several strong baselines. Single-field document retrieval (SDR) (Cafarella et al., 2009;Cafarella et al., 2008) and Multi-field document retrieval (MDR) (Pimplikar and Sarawagi, 2012) are two representative methods that treat a semi-structured data example as a single document or multi-fielded document. They apply an IR approach for QA (Zhang and Balog, 2018). Semantic table retrieval (STR) (Zhang and Balog, 2018) introduces a semantic representation for Web tables. The representation includes sets of extracted concepts and entities. BERT (Devlin et al., 2018) is a powerful Transformer-based model, which has demonstrated impressive performance in the semantic matching task. We apply this model to a concatenation of the query and the sequential tokens in a semi-structured data example, and then use a multi-layer perception for classification. All the previous methods do not consider the structure information in tables or lists. In this paper, we applied Bert-base as our backbone model and baseline. The last baseline is TAPAS (Herzig et al., 2020), a recent state-of-theart approach of QA over tables. This method encodes rows and columns to embed structural information of tables.
To measure the accuracy of matching, we use the average F1 as our metric. Precision, Recall, and F1 score are computed on the number of true positives (TP), false positives (FP), and false negatives (FN). F1 score is the harmonic mean of precision and recall. Since the matching prediction task is casted as a binary classification task, we consider F1 score as the metric and calculate average F1 based on that.
All methods are implemented in PyTorch (Paszke et al., 2017) and trained on an Ubuntu 16.04 with 64GB memory and eight GTX 1080 Ti GPU. For all data-sets, we randomly select 80% of the records as training set, 10% as validation set and the remaining 10% as test set. We train the model using training data, and fix model parameters based on the best model performance on validation set. We then test the model on test set. We perform three random runs and report both mean and standard deviation for testing performance.
We use stochastic gradient descent (SGD) with a learning rate of 2e-5. We use mini-batches of size 64, with batch size 8 for each of 8 GPUs, we use with 1 hidden-layer of 768 hidden units. We use dropout with a rate of 0.5, which is applied to all feedforward neural networks. For the pre-trainng process, We use a batch size of 64 and fine-tune for 4 epochs over the large-scale data-set for two unsupervised task. For each task, we selected the fine-tuning learning rate of 2e-5. For the graph convolutional network, we applied a Bi-LSTM with hidden-layer with 768 hidden units on the top of transformer output. The GCN contains two convolutional layers with the hidden size of 1,536. After node-level convolution, we adapted mean-pooling for graph representation. As to the positional embedding, we created a fixed sinusoidal embedding with 768 hidden units.
For all baseline models, we use pre-trained corresponding transformer models as word embedding and using the output of token [CLS] as sentence embedding. Out-of-vocabulary (OOV) words are hashed to one of 100 random embedding each initialized to mean 0 and standard deviation 1. All other hidden layer weights were initialized from random Gaussian distribution with mean 0 and standard deviation 0.01. Each hyperparameter setting was run on a same machine as the GraSSLM, using Adagrad for optimization with initial accumulator value of 0.1.

Overall Performance
We compare GraSSLM against the state-of-the-art baselines on the Table-QM, List-QM and DTable-QM datasets. The results are reported in Table 2. As GraSSLM is complementary to language models, we use GraSSLM (BERT) to denote the language models used by GraSSLM as the backbone model.  GraSSLM consistently achieves the best performance against all baselines. GraSSLM outperforms the baselines BERT by up to 5.44% (List-QM). Our model captures both the text-level and structure-level information via explicitly modeling the inherit building components and their semantic correlation from Web semi-structured data. Comparing to the best IR-based method STR, our model is up to 7.68% better on the List-QM dataset. It demonstrates that the heterogeneous graph model in GraSSLM uses structural features more effectively than those IR-based methods, which focus on slicing Web semi-structured data into different documents but ignore the potential correlations among them. Besides, GraSSLM outperforms TAPAS, the newest baseline for QA on tables, by up to 3.90% in the List-QM dataset. It illustrates that the graph-based pre-training objectives strengthen the representation capability of models for semistructured data, which will be further discussed in Section 4.3.

Method
Notably, all the baselines display servere performance drops in the DTable-QM dataset, while GraSSLM still holds the best performance (71.19%). The explicit graph modeling guides the model to learn attention over noisy contexts, which benefit semantic reasoning on complicated tables.

Ablation Studies
We conduct ablation studies on GraSSLM to empirically examine the contribution of every components, particularly, the semantic relations we proposed, which includes the following steps. Semantic Relation Ablation To further study the contribution of the semantic relations defined in Section 2.2, we removed the edges representing Caption-Content Relation, Header-Cell Relation, Subject-Attribute Relation and Cell-Cell Relation from graphs respectively and keep the other components untouched. LSTM Ablation We replace Bi-LSTM, which generates node representation from the output of language model, by average pooling to obtain the fix-sized initial embedding as the inputs of graph neural network. GCN Ablation We remove GCN, which aggregates and updates node-level representation and outputs the final prediction. We also remove the LSTM part as there is no need to generate node inputs. Instead, we use a MLP for classification, which makes the model same as one of our baselines BERT.  The results show that ablation causes performance degrade to different extents. We can observe that removing GCN, which conducts explicit graph reasoning, causes serious performance dropping 3.36% on average. It again confirms the effectiveness of explicitly modeling the inherit building components and their semantic correlations from Web semi-structured data, especially for lists (a decrease of 5.12%). The semantic relation ablations elaborate on the contribution of each relations in semantics fusion among components: Among them, the Cell-Cell relations is proven to contribute most in semantic modeling for its largest performance reduction(1.74% in average). The GraSSLM w/o Header-Cell/Subject-Attribute Relation dropped 1.33% and 0.57% in average, indicating the GCN successfully utilized these relations in table modeling. Additionally, the replacement of the Bi-LSTM component reduces the overall performance the least (0.79% on average), but is still better than simply using pooling method for token aggregation.

Pre-training Strategy Analysis
To evaluate the proposed pre-training techniques, we train the original GraSSLM model with different objectives. Specifically, we apply only one pre-training objective each time and evaluate their performance on the three datasets. The evaluation results are shown in Table 4.  Each objective contributes to the performance improvement. When we solely use WCM as the pretraining objective, the performance is increased up to 1.97% on all three datasets. WCM successfully guides the model to learn reasonable token-level embedding. Via pre-training with NPO only, the performance is increased up to 1.84%. NPO allows the inherent semantic correlations among neighbor nodes to be propagated via the structural graph connections. The combination of WCM and NPO achieves the largest performance increase (1.97%), showing that this pre-training strategy exploits the structural information in tables and lists the best.

Related Work
Early studies on Query-Table Matching adopt IR approaches. For example, Chakrabarti et al. (2020) and Pimplikar and Sarawagi (2012) convert Web tables into multi-field documents and apply document retrieval pipelines proposed in (Jurafsky and Martin, 2006;Paşca, 2003). Zhang and Balog (2018) propose to create semantic features at text level, concept level and entity level. These methods mainly consider the textual information in tables, but largely ignore the inherent structural information in tables.
In recent years, learning representations for semi-structured data has received increasing interest. Nishida et al. (2017) propose to apply convolutional models to Web table. The rationale is to consider a Web table as a matrix of text, analogous to an image of pixels. However, their model does not show strong performance, partly because the semantic relationship among neighbor cells in tables may be far more complex than the simple adjacency relation among neighbor pixels in an image. Herzig et al. (2020) propose TAPAS, a weakly supervised table parsing method. TAPAS models the structure information of tables by explicitly encoding rows and columns. Similarly, Yin et al. (2020) propose TABERT, which focuses on pre-training methods for the table QA task. The authors design a pipeline for learning row-level and column-level representations. Müller et al. (2019) also builds a graph representation on tables cells, focusing on optimizing cell answer selection. However, These works only model the row/column relations among table cells, without considering other relations including caption-content relation, header-cell relation and subject-attribute relation. In this work, we give a thorough categorization of the relations among all components in semi-structured data, and propose a graph model to incorporate all these relations.
Our work is also generally related to the broad areas of graph neural networks and pre-training techniques, we refer interested readers to (Wu et al., 2020) and (Qiu et al., 2020) for comprehensive surveys.

Conclusion
Semi-structured data on the Web, including tables and lists, present a rich source for Web QA. Most of the previous methods do not take full advantage of structural information in semi-structured data. In this paper, we propose a novel approach to model both textual and structural information in semi-structured data. Extensive experimental results verify the effectiveness of our approach.