DAGN: Discourse-Aware Graph Network for Logical Reasoning

Recent QA with logical reasoning questions requires passage-level relations among the sentences. However, current approaches still focus on sentence-level relations interacting among tokens. In this work, we explore aggregating passage-level clues for solving logical reasoning QA by using discourse-based information. We propose a discourse-aware graph network (DAGN) that reasons relying on the discourse structure of the texts. The model encodes discourse information as a graph with elementary discourse units (EDUs) and discourse relations, and learns the discourse-aware features via a graph network for downstream QA tasks. Experiments are conducted on two logical reasoning QA datasets, ReClor and LogiQA, and our proposed DAGN achieves competitive results. The source code is available at https://github.com/Eleanor-H/DAGN.


Introduction
A variety of QA datasets have promoted the development of reading comprehensions, for instance, SQuAD (Rajpurkar et al., 2016), HotpotQA (Yang et al., 2018), DROP (Dua et al., 2019), and so on. Recently, QA datasets with more complicated reasoning types, i.e., logical reasoning, are also introduced, such as ReClor (Yu et al., 2020) and LogiQA . The logical questions are taken from standardized exams such as GMAT and LSAT, and require QA models to read complicated argument passages and identify logical relationships therein. For example, selecting a correct assumption that supports an argument, or finding out a claim that weakens an argument in a passage. Such logical reasoning is beyond the capability of most of the previous QA models which focus on reasoning with entities or numerical keywords.
A main challenge for the QA models is to uncover the logical structures under passages, such as identifying claims or hypotheses, or pointing out flaws in arguments. To achieve this, the QA models should first be aware of logical units, which can be sentences or clauses or other meaningful text spans, then identify the logical relationships between the units. However, the logical structures are usually hidden and difficult to be extracted, and most datasets do not provide such logical structure annotations.
An intuitive idea for unwrapping such logical information is using discourse relations. For instance, as a conjunction, "because" indicates a causal relationship, whereas "if" indicates a hypothetical relationship. However, such discourse-based information is seldom considered in logical reasoning tasks. Modeling logical structures is still lacking in logical reasoning tasks, while current opened methods use contextual pre-trained models (Yu et al., 2020). Besides, previous graph-based methods (Ran et al., 2019;Chen et al., 2020a) that construct entity-based graphs are not suitable for logical reasoning tasks because of different reasoning units.
In this paper, we propose a new approach to solve logical reasoning QA tasks by incorporating discourse-based information. First, we construct discourse structures. We use discourse relations from the Penn Discourse TreeBank 2.0 (PDTB 2.0)  as delimiters to split texts into elementary discourse units (EDUs). A logic graph is constructed in which EDUs are nodes and discourse relations are edges. Then, we propose a Discourse-Aware Graph Network (DAGN) for learning high-level discourse features to represent passages.The discourse features are incorporated with the contextual token features from pre-trained language models. With the enhanced features, DAGN predicts answers to logical questions. Our experiments show that DAGN surpasses current opened methods on two recent logical rea- Theoretically, analog systems are superior to digital systems. A signal in a pure analog system can be infinitely detailed, while digital systems cannot produce signals that are more precise than their digital units. With this theoretical advantage there is a practical disadvantage. Since there is no limit on the potential detail of the signal, the duplication of an analog representation allows tiny variations from the original, which are errors. Context:

Question:
The statements above, if true, most strongly support which one of the following?
Digital systems are the best information systems because error cannot occur in the emission of digital signals. Option (1) (2) Context Option Figure 1: The architecture of our proposed method with an example below.
soning QA datasets, ReClor and LogiQA. Our main contributions are three-fold: • We propose to construct logic graphs from texts by using discourse relations as edges and elementary discourse units as nodes.
• We obtain discourse features via graph neural networks to facilitate logical reasoning in QA models.
• We show the effectiveness of using logic graph and feature enhancement by noticeable improvements on two datasets, ReClor and LogiQA.

Method
Our intuition is to explicitly use discourse-based information to mimic the human reasoning process for logical reasoning questions. The questions are in multiple choices format, which means given a triplet (context, question, answer options), models answer the question by selecting the correct answer option. Our framework is shown in Figure 1. We first construct a discourse-based logic graph from the raw text. Then we conduct reasoning via graph networks to learn and update the discoursebased features, which are incorporated with the contextual token embeddings for downstream answer prediction.

Graph Construction
Our discourse-based logic graph is constructed via two steps: delimiting text into elementary discourse units (EDUs) and forming the graph using their relations as edges, as illustrated in Figure 1(1).

Discourse Units Delimitation
It is studied that clause-like text spans delimited by discourse relations can be discourse units that reveal the rhetorical structure of texts (Mann and Thompson, 1988;. We further observe that such discourse units are essential units in logical reasoning, such as being assumptions or opinions. As the example shown in Figure 1, the "while" in the context indicates a comparison between the attributes of "pure analog system" and that of "digital systems". The "because" in the option provides evidence "error cannot occur in the emission of digital signals" to the claim "digital systems are the best information systems". We use PDTB 2.0  to help drawing discourse relations. PDTB 2.0 contains discourse relations that are manually annotated on the 1 million Wall Street Journal (WSJ) corpus and are broadly characterized into "Explicit" and "Implicit" connectives. The former apparently presents in sentences such as discourse adverbial "instead" or subordinating conjunction "because", whereas the latter are inferred by annotators between successive pairs of text spans split by punctuation marks such as "." or ";". We simply take all the "Explicit" connectives as well as common punctuation marks to form our discourse delimiter library (details are given in Appendix A), with which we delimit the texts into EDUs. For each data sample, we segment the context and options, ignoring the question since the question usually does not carry logical content.

Discourse Graph Construction
We define the discourse-based graphs with EDUs as nodes, the "Explicit" connectives as well as the punctuation marks as two types of edges. We assume that each connective or punctuation mark connects the EDUs before and after it. For example, the option sentence in Figure 1 is delimited into two EDUs, EDU 7 ="digital systems are the best information systems" and EDU 8 ="error cannot occur in the emission of digital signals" by the connective r ="because". Then the returned triplets are (EDU 7 , r, EDU 8 ) and (EDU 8 , r, EDU 7 ). For each data sample with the context and multiple answer options, we separately construct graphs corresponding to each option, with EDUs in the same context and every single option. The graph for the single option k is denoted by G k = (V k , E k ).

Discourse-Aware Graph Network
We present the Discourse-Aware Graph Network (DAGN) that uses the constructed graph to exploit discourse-based information for answering logical questions. It consists of three main components: an EDU encoding module, a graph reasoning module, and an answer prediction module. The former two are demonstrated in Figure 1(2), whereas the final component is in Figure 1(3).
EDU Encoding An EDU span embedding is obtained from its token embeddings. There are two steps. First, similar to previous works (Yu et al., 2020;, we encode such input sequence "<s> context </s> question || option </s>" into contextual token embeddings with pre-trained language models, where <s> and </s> are the special tokens for RoBERTa  model, and || denotes concatenation. Second, given the token embedding sequence {t 1 , t 2 , ..., t L }, the n-th EDU embedding is obtained by e n = l∈Sn t l , where S n is the set of token indices belonging to n-th EDU. Graph Reasoning After EDU encoding, DAGN performs reasoning over the discourse graph. Inspired by previous graph-based models (Ran et al., 2019;Chen et al., 2020a), we also learn graph node representations to obtain higher-level features. However, we consider different graph construction and encoding. Specifically, let G k = (V k , E k ) denote a graph corresponding to the k-th option in answer choices. For each node v i ∈ V, the node indicates the neighbors of node v i . W r ji is the adjacency matrix for one of the two edge types, where r E indicates graph edges corresponding to the explicit connectives, and r I indicates graph edges corresponding to punctuation marks.
The model first calculates weight α i for each node with a linear transformation and a sigmoid function whereṽ i is the message representation of node v i . α j and v j are the weight and the node embedding of v j respectively. After the message propagation, the node representations are updated with the initial node embeddings and the message representations by where W u and b u are weight and bias respectively. The updated node representations v i will be used to enhance the contextual token embedding via summation in corresponding positions. Thus t l = t l + v n , where l ∈ S n and S n is the corresponding token indices set for n-th EDU.
Answer Prediction The probabilities of options are obtained by feeding the discourse-enhanced token embeddings into the answer prediction module. The model is end-to-end trained using cross entropy loss. Specifically, the embedding sequence first goes through a layer normalization (Ba et al., 2016), then a bidirectional GRU (Cho et al., 2014). The output embeddings are then added to the input ones as the residual structure (He et al., 2016). We finally obtain the encoded sequence after another layer normalization on the added embeddings.
We then merge the high-level discourse features and the low-level token features. Specifically, the variant-length encoded context sequence, questionand-option sequence are pooled via weighted summation wherein the weights are softmax results of

Experiments
We evaluate the performance of DAGN on two logical reasoning datasets, ReClor (Yu et al., 2020) and LogiQA , and conduct ablation study on graph construction and graph network. The implementation details are shown in Appendix B.

Datasets
ReClor contains 6,138 questions modified from standardized tests such as GMAT and LSAT, which are split into train / dev / test sets with 4,638 / 500 / 1,000 samples respectively.

Results
The experimental results are shown in Tables    Compared with RoBERTa-Large, the improvement on the HARD subset is remarkably 4.46%. This indicates that the incorporated discourse-based information supplements the shortcoming of the baseline model, and that the discourse features are beneficial for such logical reasoning. Besides, DAGN and DAGN (Aug) also outperform the baseline models on LogiQA, especially showing 4.01% improvement over RoBERTa-Large on the test set.

Ablation Study
We conduct ablation study on graph construction details as well as the graph reasoning module. The results are reported in Table 3.
Varied Graph Nodes We first use clauses or sentences in substitution for EDUs as graph nodes. For clause nodes, we simply remove "Explicit" connectives during discourse unit delimitation. So that the texts are just delimited by punctuation marks. For sentence nodes, we further reduce the delimiter library to solely period ("."). Using the modified graphs with clause nodes or coarser sentence nodes, the accuracy of DAGN drops to 64.40%. This indicates that clause or sentence nodes carry less discourse information and act poorly as logical reasoning units.

Varied Graph Edges
We make two changes of the edges: (1) modifying the edge type, (2) modifying the edge linking. For edge type, all edges are regarded as a single type. For edge linking, we ignore discourse relations and connect every pair of nodes, turning the graph into fully-connected. The resulting accuracies drop to 64.80% and 61.60% respectively. It is proved that in the graph we built, edges link EDUs in reasonable manners, which properly indicates the logical relations.
Ablation on Graph Reasoning We remove the graph module from DAGN and give a comparison. This model solely contains an extra prediction module than the baseline. The performance on ReClor dev set is between the baseline model and DAGN. Therefore, despite the prediction module benefits the accuracy, the lack of graph reasoning leads to the absence of discourse features and degenerates the performance. It demonstrates the necessity of discourse-based structure in logical reasoning.

Related Works
Recent datasets for reading comprehension tend to be more complicated and require models' capability of reasoning. For instance, HotpotQA (Yang et al., 2018), WikiHop (Welbl et al., 2018), Open-BookQA (Mihaylov et al., 2018), and MultiRC (Khashabi et al., 2018) require the models to have multi-hop reasoning. DROP (Dua et al., 2019) and MA-TACO  need the models to have numerical reasoning. WIQA (Tandon et al., 2019) and CosmosQA (Huang et al., 2019) require causal reasoning that the models can understand the counterfactual hypothesis or find out the causeeffect relationships in events. However, the logical reasoning datasets (Yu et al., 2020; require the models to have the logical reasoning capability of uncovering the inner logic of texts. Deep neural networks are used for reasoningdriven RC. Evidence-based methods (Madaan et al., 2020;Huang et al., 2020; generate explainable evidence from a given context as the backup of reasoning. Graph-based methods De Cao et al., 2019;Cao et al., 2019;Ran et al., 2019;Chen et al., 2020b;Xu et al., 2020b; explicitly model the reasoning process with constructed graphs, then learn and update features through message passing based on graphs. There are also other methods such as neuro-symbolic models (Saha et al., 2021) and adversarial training (Pereira et al., 2020). Our paper uses a graph-based model. However, for uncovering logical relations, graph nodes and edges are customized with discourse information.
Discourse information provides a high-level understanding of texts and hence is beneficial for many of the natural language tasks, for instance, text summarization (Cohan et al., 2018;Joty et al., 2019;Xu et al., 2020a;Feng et al., 2020), neural machine translation (Voita et al., 2018), and coherent text generation Bosselut et al., 2018). There are also discourse-based applications for reading comprehension. DISCERN (Gao et al., 2020) segments texts into EDUs and learns interactive EDU features. Mihaylov and Frank (2019) provide additional discourse-based annotations and encodes them with discourseaware self-attention models. Unlike previous works, DAGN first uses discourse relations as graph edges connecting EDUs for texts, then learns the discourse features via message passing with graph neural networks.

Conclusion
In this paper, we introduce a Discourse-Aware Graph Network (DAGN) to addressing logical reasoning QA tasks. We first treat elementary discourse units (EDUs) that are split by discourse relations as basic reasoning units. We then build discourse-based logic graphs with EDUs as nodes and discourse relations as edges. DAGN then learns the discourse-based features and enhances them with contextual token embeddings. DAGN reaches competitive performances on two recent logical reasoning datasets ReClor and LogiQA.

B Implementation Details
We fine-tune RoBERTa-Large  as the backbone pre-trained language model for DGAN, which contains 24 hidden layers with hidden size 1024. The overall model is end-to-end trained and updated by Adam (Kingma and Ba, 2015) optimizer with an overall learning rate of 5e-6 and a weight decay of 0.01. The overall dropout rate is 0.1. The maximum sequence length is 256. We tune the model on the dev set to obtain the best iteration steps of graph reasoning, which is 2 for ReClor data, and 3 for LogiQA data. The model is trained for 10 epochs with a batch size of 16 on Nvidia Tesla V100 GPU. For the answer prediction module, the hidden size of GRU is the same as the token embeddings in the pre-trained language model, which is 1024. The two-layer perceptron first projects the concatenated vectors with a hidden size of 1024 × 3 to 1024, then project 1024 to 1.