Answering Conversational Questions on Structured Data without Logical Forms

We present a novel approach to answering sequential questions based on structured objects such as knowledge bases or tables without using a logical form as an intermediate representation. We encode tables as graphs using a graph neural network model based on the Transformer architecture. The answers are then selected from the encoded graph using a pointer network. This model is appropriate for processing conversations around structured data, where the attention mechanism that selects the answers to a question can also be used to resolve conversational references. We demonstrate the validity of this approach with competitive results on the Sequential Question Answering (SQA) task.


Introduction
In recent years, there has been significant progress on conversational question answering (QA), where questions can be meaningfully answered only within the context of a conversation (Iyyer et al., 2017;Choi et al., 2018;Saha et al., 2018). This line of work, as in single QA setup, falls into two main categories, (i) the answers are extracted from some text in a reading comprehension setting, (ii) the answers are extracted from structured objects, such as knowledge bases or tables. The latter is commonly posed as a semantic parsing task, where the goal is to map questions to some logical form which is then executed over the knowledge base to extract the answers.
In semantic parsing, there is extensive work on using deep neural networks for training models over manually created logical forms in a supervised learning setup (Jia and Ling et al., 2016;Dong and Lapata, 2018). However, creating labeled data for this task can be expensive and time-consuming. This problem resulted in research that investigates semantic parsing with weak supervision where training data consists of questions and answers along with the structured resources to recover the logical form representations that would yield the correct answer Iyyer et al., 2017).
In this paper, we follow this line of research and investigate answering sequential questions with respect to structured objects. In contrast to previous approaches, instead of learning the intermediate logical forms, we propose a novel approach that encodes the structured resources, i.e. tables, along with the questions and answers from the context of the conversation. This approach allows us to handle conversational context without the definition of detailed operations or a vocabulary dependent on the logical form formalism that are required in the weakly supervised semantic parsing approaches.
We present empirical performance of the proposed approach on the Sequential Question Answering task (SQA) (Iyyer et al., 2017) which improves the state-of-the-art performance on all questions, particularly on the follow-up questions that require effective encoding of the context.

Approach
We build a QA model for a sequence of questions that are asked about a table and can be answered by selecting one or more table cells.

Graph Formulation
We encode tables as graphs by representing columns, rows and cells as nodes. Figure 1 shows an example graph representing how we encode the table in relation to q 2 , which is the follow up question to q 1 . Within a column, cells with identical texts are collapsed into a single node. In the example graph, we only create a single node for "Toronto" and a single node for "Montreal". We then add 4 directed edges that connect columns and rows to cells, one in either direction (orange and green edges in the figure). The question is represented by a node covering the entire question text and a node for each token. The main question node is connected to each token, column and cell node. 1 Nodes have associated nominal feature sets. All nodes have a feature indicating their type: column, row, cell, question and question token. The text in column (i.e., the column name), cell, question and token nodes are added to the corresponding node feature set adopting a bag-of-words representation. Column, row and cell nodes have additional features that indicate their column (for cell and column nodes) and row (for cell and row nodes) indexes.
We align question tokens with column names and cell text using the Levenshtein edit distance between n-grams in the question and the table text, similar to previous work (Shaw et al., 2019). In particular, we score every question n-gram with the normalized edit distance 2 and connect the cell to the token span if the score is > 0.5. Through the alignment, the cell is connected to all the tokens of the matching span and the binned score is added as an additional feature to the cell. In Figure 1, the "building" and "floors" tokens in the questions are connected to the matching "Building" and "Floors" column nodes from the table. 1 We do not show some of these connections in the figure to avoid clutter.
2 ned(v, w) = ed(v,w) max (|v|,|w|) Numeric Operations In order to allow operations over numbers and date expressions, we extend our graph with a set of relations involving numerical values in the question and table cells.
We infer the type of the numerical values in a column, such as the ones in the "Floors" column, by picking their most frequent type (number or date). Then, we add special features to each cell in the column: the rank and inverse rank of the value in the cell, considering the other cell values from the same column. These features allow the model to answer questions such as "what is the building with most floors?". In addition, we add a new node to the graph for each numeric expression from the question (such as the number 60 from the second question in Figure 1), and we connect it to the tokens it spans. The numerical nodes originated from the question are connected to the table cells containing numerical values. The connection type encodes the result of the comparison between the question and cell values, lesser, greater or equal, as shown in the figure (yellow edges). This relations allow the model to answer questions such as "which buildings have more than 50 floors?".
Context We extend the model to capture conversational context by using the feature-based encoding in the graph formulation. In order to handle follow-up questions, we add the previous answers to the graph by marking all the answer rows, columns and cells with nominal features. The nodes with borders in Figure 1 contain the answers to the first question q 1 : "what are the buildings in toronto?". In the example, the first two rows receive a feature ANSWER ROW, the "building" column a feature ANSWER COLUMN and "First Canadian Place" and "Commerce Court West" a feature ANSWER CELL. Notice that the content of q 1 is not encoded in the graph, only its answers.

Node Representations
Before the initial encoder layer, all nodes are mapped to vector representations using learned embeddings. For nodes with multiple features, such as column and cell nodes, we reduce the set of feature embeddings to a single vector using the mean. We also concatenate an embedding encoding whether the node represents a question token, or not.

Encoder
We use a Graph Neural Network (GNN) encoder based on the Transformer (Vaswani et al., 2017). The only modification is that the self-attention mechanism is extended to consider the edge label between each pair of nodes. We follow the formulation of Shaw et al. (2019) that uses additive edge vector representations. The self-attention weights are therefore calculated as: where s ij is the unnormalized attention weight for the node vector representations x i and x j , and W q and W k are parameter matrices. This extension introduces r ij to the calculation, which is a vector representation of the edge label between the two nodes. Edge vectors are similarly added when summing over node representations to produce the new output representations. We use edge labels corresponding to relative positions between tokens, alignments between tokens and table elements, and relations between table elements, as described in Section 2.1. These amount to 9 fixed edge labels in the graph (4 between rows/cells/columns, 2 between question and cells/columns, and 3 numeric relations) and a tunable number of relative token positions).

Answer Selection
We extend the Transformer decoder to include a copy mechanism based on a pointer network (Vinyals et al., 2015). The copy mechanism allows the model to predict sequences of answer columns and rows from the input, rather than select symbols from an output vocabulary. Figure 2 visualizes the entire model architecture.

Related Work
Semantic parsing models can be trained to produce gold logical forms using an encoder-decoder approach (Suhr et al., 2018) or by filling templates (Xu et al., 2017;Peng et al., 2017;Yu et al., 2018). When gold logical forms are not available, they are typically treated as latent variables or hidden states and the answers or denotations are used to search for correct logical forms (Yih et al., 2015;Long et al., 2016;Iyyer et al., 2017). In some cases, feedback from query execution is used as a reward signal for updating the model through reinforcement learning (Zhong et al., 2017;Agarwal et al., 2019) or for refining parts of the query (Wang et al., 2018). In our work, we do not use logical forms or RL, which can be hard to train, but simplify the training process by directly matching questions to table cells.
Most of the QA and semantic parsing research focuses on single turn questions. We are interested in handling multiple turns and therefore in modeling context. In semantic parsing tasks, logical forms (Iyyer et al., 2017;Sun et al., 2018b;Guo et al., 2018) or SQL statements (Suhr et al., 2018) from previous questions are refined to handle follow up questions. In our model, we encode answers to previous questions by marking answer rows, columns and cells in the table, in a nonautoregressive fashion.
In regards to how structured data is represented, methods range from encoding table information, metadata and/or content, (Gur et al., 2018;Sun et al., 2018b;Petrovski et al., 2018) to encoding relations between the question and table items (Krishnamurthy et al., 2017) or KB entities (Sun et al., 2018a). We also encode the table structure and the question in an annotation graph, but use a different modelling approach.

Experimental Setup
We evaluate our method on the SequentialQA (SQA) dataset (Iyyer et al., 2017), which consists of sequences of questions that can be answered from a given table. The dataset is designed so that every question can be answered by one or more table cells. It consists of 6, 066 answer sequences containing 17553 questions (2.9 question per sequence on average). Table 2 shows an example.
We lowercase and remove punctuation from questions, cell texts and column names. We then split each input utterance on spaces to generate a sequence of tokens. 3 We only keep the most frequent 5, 000 word types and map everything else to one of 2, 000 OOV buckets. Numbers and dates are parsed in a similar way as in the DynSp and the Neural Programmer (Neelakantan et al., 2016a).
We use the Adam optimizer (Kingma and Ba, 2014) for optimization and tune hyperparameters with Google Vizier (Golovin et al., 2017). More details and the hyperparameter ranges can be found in the appendix (A).
All numbers given for our model are averaged over 5 independent runs with different random initializations.
We observe that our model improves the SOTA from 45.6% by CAMP to 55.1% in question accuracy (ALL), reducing the relative error rate by 18%. For the initial question (POS1), however, it is behind DYNSP by 3.7%. More interestingly, our model handles follow up questions especially well outperforming the previously best model FP by 20% on POS3, a 28% relative error reduction.
3 Whitespace tokenization simplifies the preprocessing but we can expect an off-the-shelf tokenizer to work as good or even better. As in previous work, we also report performance for a non-contextual setup where follow up questions are answered in isolation. We observe that our model effectively leverages the context information by improving the average question accuracy from 45.1% to 55.1% in comparison to the use of context in DYNSP yielding 2.7% improvement. If we provide the previous reference answers, the average question accuracy jumps to 61.7%, showing that 6.6% of the errors are due to error propagation.
Numeric operations For understanding how effective our model is in handling numeric operations, we trained models without the specific handling explained in Section 2. We find that that the overall accuracy decreases from 55.1% to 51.5%, demonstrating the competence of our approach to model such operations. This effectiveness is further emphasized when focusing on questions that contain a superlative (e.g., "tallest", "most expensive") with a performance difference of 47.3% with numeric relations and 40.3% without. It is worthwhile to call out that the model without special number handling still out-performs the previous SOTA CAMP by more than 5 points (45.6% vs 55.1%).  1  Australia  2  1  0  3  2  Italy  1  1  1  3  3  Germany  1  0  1  2  4  Soviet Union  1  0  0  1  5  Switzerland  0  2  1  3  6 United States 0 1 0 1 7 Great Britain 0 0 1 1 7 France 0 0 1 1

55.1% overall). 4
Error analysis. Table 2 shows an example that is consistently handled correctly by the model. It requires a simple string match ("nations" → "nation"), and implicit and explicit comparisons. We performed error analysis on test data over 100 initial (POS1) and 100 follow up questions (POS> 1) to identify the limitations of our approach.
For the initial questions, we find that 26% are match errors, e.g., the model does not match "episode" to "Eps #", or cases where the model has to exclude rows with empty values from the results. 29% of the errors require a more sophisticated table understanding, e.g., rows that represent the total of all other rows should often not be included in the answers. For 15% of the errors, we think that the reference answer is incorrect and for another 15% the model prediction is correct but contains duplicates because multiple rows contain the same value. 12% of the errors are around complex matches such as selecting certain ranks ("the first two"), exclusion or negation.
For the follow up questions, 38% are caused by complex matches; 17% are match errors; 13% of the errors are due to incorrect reference answers and 11% would require advanced table understanding. Only 8% of the errors are due to incorrect management of the conversational context. Section B of the appendix contains a more detailed analysis and error examples.

Discussion
We present a model for table-centered conversational QA that predicts the answers directly from the table. We show that this model improves SOTA on SQA dataset and particularly handles conversational context effectively.
As future work, we plan to expand our model with pre-trained language representations (e.g., BERT (Devlin et al., 2018)) in order to improve performance on initial queries and matching of queries to table entries. To handle larger tables, we will investigate sharding the table row-wise, running the model on all the shards first, and then on the final table which combines all the answer rows. Daya Guo, Duyu Tang, Nan Duan, Ming Zhou, and Jian Yin. 2018. Dialog-to-action
For the encoder and decoder, we select the number of layers from [3, 6] and embedding and hidden dimensions from {128, 256, 512}, setting the feed forward layer hidden dimensions 4× higher. We employ dropout at training time with P dropout selected from {0.2, 0.4, 0.5}. We select the attention heads from {4, 8, 16} and use a clipping distance of 6 for relative position representations.
We use the Adam optimizer (Kingma and Ba, 2014) with β 1 = 0.9, β 2 = 0.98, and = 10 −9 . We tune the learning rate and use the same warmup and decay strategy for learning rate as Vaswani et al. (2017), selecting a number of warm-up steps up to a maximum of 2000. We run the training until convergence for a fixed number of steps (100, 000) and use the final checkpoint for evaluation. We choose batch sizes from {32, 64}.

B Results
Table Size Given that our model makes use of the whole table, it is conceivable that the performance of our approach can be more sensitive to the table size than methods that predict intermediate representations. Plotting the model performance with respect to number of cells in the table (Figure 3), we observe that the performance does not vary significantly by the table size.
Error Analysis For a detailed analysis, we annotated 100 initial and follow-up questions with the following match types: MATCH A lexical or semantic match error such as not matching "episode" with "EPS #".   COMPLEX MATCH A question that would require a numerical value match, some sort of sorting or a negation to be answered. "what is the area of all of the non-remainders?" GOLD A question with a wrong reference answers. "what gene functions are listed?" -Gold points to Category column rather than Gene Functions.
ANSWER SET The returned answer should be duplicate free. "what are all of the rider's names?", but the table contains "Carl Fogarty" multiple times.
CONTEXT Only used in follow-up questions. This error indicates that a more sophisticated context management is needed.
OTHER Any other kind of error. Table 3 contains the error counts for initial questions and follow-ups, respectively. and which has been active for the longest?
Reasoning with text and date (1986-present) who else is in that field? Exclusion.
of these, which did not publish on february 9? Negation. The model is doing the right thing but missing one of the values.
what is the highest passengers for a route?
The model selects the 2nd highest and not the 1st.

TABLE UNDERSTANDING
now, during which year did they have the worst placement?
Requires understanding that "Withdrawal .." is worse than any position.
which of these seasons have four judges? Requires counting the number of named entities within a single cell.
GOLD what gene functions are listed? Gold points to "Category" column rather than "Gene Functions" which aircraft have 1 year of service Gold points to the 4th column ("in service") instead of the 1st column ("aircraft") CONTEXT when was thaddeus bell born?, when was klaus jurgen schneider born?, which is older?
The correct birthday is selected, but not the person.