Does Structure Matter? Encoding Documents for Machine Reading Comprehension

Machine reading comprehension is a challenging task especially for querying documents with deep and interconnected contexts. Transformer-based methods have shown advanced performances on this task; however, most of them still treat documents as a flat sequence of tokens. This work proposes a new Transformer-based method that reads a document as tree slices. It contains two modules for identifying more relevant text passage and the best answer span respectively, which are not only jointly trained but also jointly consulted at inference time. Our evaluation results show that our proposed method outperforms several competitive baseline approaches on two datasets from varied domains.


Introduction
Machine Reading Comprehension (MRC) is the task of reading a given text and answering questions about it (Liu et al., 2019). Some MRC tasks such as SQuAD (Rajpurkar et al., 2016(Rajpurkar et al., , 2018,and ShARC (Saeidi et al., 2018) provide a short text snippets as the context documents; while others such as TriviaQA (Joshi et al., 2017), Natural Questions (Kwiatkowski et al., 2019) and Doc2Dial (Feng et al., 2020) use full articles as documents. Most top performing models on MRC tasks use different variants of Transformers (Vaswani et al., 2017). Transformer-based models typically only consider a certain number of tokens, utilize a sliding window approach (Richardson et al., 2013) or segment the document into passages (Hu et al., 2019; due to the constraint on the size of input sequence. More recent works explore how to scale up input length Beltagy et al., 2020;Kitaev et al., 2020;Ainslie et al., 2020) but still mainly focus on flat sequences. In addition to scaling up input length, ETC (Ainslie et al., 2020) also propose to deal with encoding structured inputs

Sequence View
Figure 1: A sample document segment with the hierarchical structure (left), the partial tree slices (right) and sample dialogue turns (bottom right).
based on relative position encoding (Shaw et al., 2018) through the global-local mechanism. A series of recent work explores incorporating structured knowledge embedded in text into MRC (Shen et al., 2020;Dhingra et al., 2020). However, such kind of linking information for creating triples is not necessarily prominent in documents other than Wikipedia. Some works segment the document content based on its semantic structures and rank them based on their relevance to the query (Yan et al., 2019;Lee et al., 2018;Wang et al., 2018;Zheng et al., 2020;. Another thread of works, on hierarchical document encoding (Li et al., 2015;Yang et al., 2016;Guo et al., 2019), first obtain sentence level representations then encode document based on the sentence vectors. Those works do not directly apply on fine-grained answer extraction across sentences.
In many online documents, certain important information unfolds through the semantic relations of hierarchical structures such as parent-child and siblings between different parts of the document. Figure 1 illustrates the difference when using a doc-ument with and without the structure information for a MRC task. For query U1, it is crucial to keep in mind we are in the context of "How to appeal a TVB ticket" and "Online" while reading the passage of "You will need" to find the answer to the user query. However, conventional Transformers fail to capture such contextual information when the text is too long to fit in the maximum sequence length allowed.
In this work, we explore the utilization of document structure for the focused task of fine-grained Machine Reading Comprehension on document. We propose a Transformer-based method that reads a document as tree slices; it jointly learns the relevance of paragraphs and spans, and then performs a cascaded inference to find the best answer span. Our work is intuitively inspired by how people read through documents  based on structural cues such as titles and subtitles, and then focus on the relevant parts to search for an answer. We utilize the structural information naturally available in online documents for identifying tree slices. Each slice corresponds to nodes along a path from a root node to a lower level child node as illustrated by the right part of Figure 1. Thus, we are able to capture the essential structural information for the inference that could be outside of a conventional sliding window or text segment. Compared to approaches such as Longformer (Beltagy et al., 2020) or ETC (Ainslie et al., 2020), our approach can be directly applied to many existing pretrained models, and has a small GPU memory footprint. RikiNet ) employs a dynamic paragraph dual-attention reader and a multi-level cascaded answer predictor, while our tree slices consider hierarchical structures above paragraphs, and our cascaded inference is in beam search style rather than greedy decoding style in RikiNet.
We evaluate on two datasets with structured documents: one obtained from Natural Questions (Kwiatkowski et al., 2019), which is based on Wikipedia articles, and one from Doc2Dial (Feng et al., 2020), which is based on web pages of several domains. Our proposed method is compared with several baselines to see performance gain on both datasets. For example, our method achieves 4% gain of F1 on Doc2Dial, which shows its superiority on small-scaled dataset across multiple domains.
Our contributions can be summarized as follows: (1) We propose a Transformer-based method that reads a document as a tree. It simultaneously identifies the relevance of paragraphs and finds the answer span via jointly trained models with cascaded inference.
(2) Our method can utilize common structures as seen in many web documents. It allows Transformer models to read in more focused content but with deep context; thus it can be used to handle long documents in an efficient way. (3) Our proposed method outperforms several competitive baseline methods on two kinds of MRC tasks with documents from varied domains.

Approach
We adopt a Transformer-based document-tree-slice encoder with joint learning and cascaded inference. Our approach is influenced by the pattern of human behavior during reading , which is to focus on a smaller portion at a time and favor the more relevant parts while looking for answer. This approach can also overcome the constraint on fixed-length input allowed by the common Transformer architecture (Vaswani et al., 2017). More importantly, this enables us to always include important structural context information during encoding.

Tree Slicing
To obtain the tree representation of a web page, we consider the different levels of HTML title tags as the main indicators of the hierarchical structures such as parent-child and siblings in Figure 1. More details are provided in Section 3.
Formally, we define an example in the dataset as (Q, D, s, e) where Q is a question, D is a document, s and e denote the inclusive indices pointing to the start and end of the target answer span.
Suppose one does not consider the structure information, D is treated as a sequence and sent to Transformer encoder. For long documents, the sliding window approach is widely used to truncate D into m overlapping fragments D 1 , ...D m , and (Q, D, s, e) is converted to m training instances In our proposed approach to encode a document, we consider the structured information along with its content. Given a document D, let k be the number of leaf nodes in its tree structure.
where P i is a leaf node, s i and e i are mapped indices in P i , and A i denotes P i 's ancestor chain in the document tree of D. Figure 1 would be the list of {'How to appeal a TVB ticket conviction', 'Online', 'You will need'}. Intuitively, the tree slice approach ensures that the most relevant structural information, the ancestor chain, is always taken into account and attended to with Transformer encoder, while this is unlikely to be guaranteed by the sliding window truncation.

Joint Model with Cascaded Inference
With tree slicing approach, from each document we have many paragraphs to select the answer span from, as compared to the case of sliding windows. In order to teach the model to favor the candidates from the more relevant parts of the document, we train a joint model to simultaneously learn to identify the relevance of paragraphs and find the answer span. Then we perform a cascaded inference to first find the most relevant paragraphs and then find the best answer span from them, based on the scores from the joint model, as Figure 2 shows.

Joint model
The encoded representation of C can be used to perform two tasks, each being handled by a separate module: 1) the pooler layer and the matching layer (both linear layers) predict how likely a paragraph P contains the answer; 2) the span selection layer (another linear layer) identifies the answer span from P . Each training instance is converted to (C, s, e, g) where g ∈ {0, 1} denotes whether P contains the answer. We define the loss function to be Loss(g, s, e, C; θ) = L CE (f hit (g, C; θ)) + λ * (L CE (f start (s, C; θ)) where L CE is the Cross Entropy loss function, θ denotes the model parameters, and each f is the score obtained by the corresponding linear layer on top of the last layer representation of Transformer encoder: f hit by the pooler layer and the matching layer, and f start and f end by the span selection layer.
Cascaded inference After the two modules of the model are jointly trained, we conduct a cascaded inference in a beam search style.
• First, from all the instances corresponding to tree slices of a single document, we select the top n instances ranked by f hit (g = 1, C; θ). This is important for filtering out high scored spans from irrelevant tree slices.
• Finally, we choose the document span with the highest Score(C, s, e) as the answer.
Given a document with tree slices, we would create more instances than the sliding window approach. However, with the joint training and cascaded inference, our model reaches better accuracy in less training time, as will be shown in Section 4.

Data
Our focused task is utilizing document structure in contextual representation for fine-grained MRC. Since there is very few prior MRC datasets that provides document structure information, we identified two public datasets where HTML markup tags are available in the document data together with QA pairs, and extract tree structure out of the HTML documents for MRC. Data script could be found at http://html2struct.github.io.
Extract Tree Structure To obtain the tree representation of documents from the two datasets, we first parse HTML files to get markup tags of the textual content elements, which corresponds to the titles, lists, tables and paragraphs. We consider the different levels of title tags as the main indicators of the hierarchical structures such as parent-child and siblings. Thus, the stem nodes are inherently section or subsection titles of the article and leaf nodes are typically paragraphs, list content or table content. We assign the article title as the tree root. Please refer to Appendix A for more details about the data statistics for the experiment.
NQStruct Natural Question (Kwiatkowski et al., 2019) provides QA pairs that are grounded in Wikipedia articles. The original task provides answers in two formats: long answer, typically a paragraph or table; short answer, typically one or more entities. In our task, we focus on identifying the short answer given the whole document as the input, and do not use the long answers data. We observe the bias on answers appearing in first paragraph, which is significant enough to serve as a baseline (Kwiatkowski et al., 2019). Thus, we follow Geva and Berant (2018) to alleviate such bias by only considering the questions where the short answer does not come from the first paragraph. As a result, we derive a subset of 48K examples from about 100K examples with short answers from training and dev sets.
D2DStruct Doc2Dial (Feng et al., 2020) provides document-grounded dialogues with annotations of dialogue scenes, which allow us to identify question-answer pairs that are most related to our target task. Specifically, we combine each turn of the agent responding to a user query, together with the previous dialogue context, as a question. The public dataset contains over 4.1K documentgrounded dialogues based on about 450 documents from different domains, and we derive 9.3K QA pairs out of it.

Experiments and Results
We compare our proposed method (TreeJC in short) with several baseline methods. Next we describe the baselines, the experiment settings, and present the evaluation results.

Baselines
Sliding Window (SW) is a popular question answering baseline that trains a span selection model with Transformer encoding document trunks as described in Section 2.
Longformer is a Transformer model that handles long documents (Beltagy et al., 2020). We experiment the sliding window approach above with Longformer-base pretrained model with max sequence length of 4096 and a stride of 3072.
IR+SW is a pipeline approach that first identifies small number of k candidate paragraphs (k = 10 in the experiments here) via an information retrieval mechanism BM25 (Robertson et al., 1995), and then uses the SW approach. We consider it as a solution with reduced time complexity from the traditional SW approach for us to compare with.
LeafJC For ablation study, we experiment with a variant of TreeJC approach that excludes ancestors during encoding. The other implementation and experimental details are similar to TreeJC.

Experiment Settings
All models are implemented in PyTorch. Pretrained models are Roberta-base for SW, IR+SW, LeafJC and TreeJC, and Longformer-base for Longformer. Implementations of SW, Longformer and IR+SW are adapted from the SQuAD example code 1 in HuggingFace Transformers (Wolf et   2019). For a fair comparison, input to SW, Longformer and IR+SW is flattened equivalent of the tree input to LeafJC and TreeJC and does not include the HTML tags of web pages. All experiments were done on a single V100 GPU. In order to encode long sequences, Longformer requires much larger GPU memory, only 2 instances could fit in one V100 GPU, whereas 27 instances could fit in one V100 GPU with Roberta. Please see Appendix B for more details about experiment setup.
For the training of our approach TreeJC, positive instances are up-sampled to reach a balanced proportion of positive and negative training instances. To avoid the consequent bias towards longer documents, the loss from each example (document QA pair) is scaled down by the number of training instances from this example. λ and γ in Section 2 is set to be 0.5 and 1, respectively.

Results
For evaluation, we use exact match score and tokenlevel F1 score (Rajpurkar et al., 2018). Table 1 and Table 2 present the evaluation results on the test sets of D2DStruct and NQStruct respectively along with the training time. All numbers are in the form of mean ± std, which is from three runs with different random seeds.
We observe consistent performance gains by TreeJC over almost all baselines. TreeJC shows a significant improvement over SW, which indicates the effectiveness of encoding the structure information with our joint model with cascaded inference. LeafJC performs better than SW but worse than TreeJC, which confirms the importance of including ancestor nodes during encoding. Longformer 2 serves as a competitive baseline and it achieves half a point higher F1 for D2DStruct dataset, however, at the cost of much longer training time. IR+SW method, on the other hand, shows high efficiency but suffers lower effectiveness, attributing to the fact that the IR method only achieves around 73% recall. In order to further examine how our approach performs on documents with different sizes, we break down the results on NQStruct dataset and compare the performances in Table 3. The results show that our approach has a clear gain on all document lengths over SW, especially on very long documents.

Conclusion
We introduce a new Transformer-based method with joint learning and cascaded inference inspired by the tree structures of documents for machine reading comprehension. It outperforms several competitive baselines on two datasets from multiple domains. In particular, our study demonstrates that the proposed model is effective to encode longer documents with deep contexts for MRC tasks.  Table 4. The distribution of document length is shown in Table 5.

D2DStruct
The public release of Doc2Dial only provides train and dev sets 3 . For filtering out the non-answer agent turns, we filter out the cases where agent turn is grounded on the (sub)section titles. We combine the two and then create train/dev/test splits as 70%, 15%, 15% where 50% of the dev/test set are from documents unseen in training set.

B Experiment Settings
The deep learning systems are in PyTorch, and use Transformer encoder from HuggingFace Transformers. We use Roberta-base pretrained model and max sequence length of 512 unless otherwise stated. All experiments were done with fp16, on a single V100 GPU. Table 8 presents generation configurations and hyper-parameters that are shared by both datasets. For evaluation, we use the evaluation script 2.0 of SQuAD.