Scalable Multi-Hop Relational Reasoning for Knowledge-Aware Question Answering

While fine-tuning pre-trained language models (PTLMs) has yielded strong results on a range of question answering (QA) benchmarks, these methods still suffer in cases when external knowledge are needed to infer the right answer. Existing work on augmenting QA models with external knowledge (e.g., knowledge graphs) either struggle to model multi-hop relations efficiently, or lack transparency into the model's prediction rationale. In this paper, we propose a novel knowledge-aware approach that equips PTLMs with a multi-hop relational reasoning module, named multi-hop graph relation networks (MHGRN). It performs multi-hop, multi-relational reasoning over subgraphs extracted from external knowledge graphs. The proposed reasoning module unifies path-based reasoning methods and graph neural networks to achieve better interpretability and scalability. We also empirically show its effectiveness and scalability on CommonsenseQA and OpenbookQA datasets, and interpret its behaviors with case studies. In particular, MHGRN achieves the state-of-the-art performance (76.5\% accuracy) on the CommonsenseQA official test set.


Introduction
Many recently proposed question answering tasks require not only machine comprehension of the question and context, but also relational reasoning over entities (concepts) and their relationships based by referencing external knowledge (Talmor et al., 2019;Sap et al., 2019;Clark et al., 2018;. For example, the question in Fig. 1 requires a model to perform relational reasoning over mentioned entities, i.e., to infer latent relations among the concepts: {CHILD, SIT, DESK, Figure 1: Illustration of knowledge-aware QA. A sample question from CommonsenseQA can be better answered if a relevant subgraph of ConceptNet is provided as evidence. Blue nodes correspond to entities mentioned in the question, pink nodes correspond to those in the answer. The other nodes are some associated entities introduced when extracting the subgraph. ⋆ indicates the correct answer.
SCHOOLROOM}. Background knowledge such as "a child is likely to appear in a schoolroom" may not be readily contained in the questions themselves, but are commonsensical to humans.
Despite the success of large-scale pre-trained language models (PTLMs) (Devlin et al., 2019;Liu et al., 2019b), there is still a large performance gap between fine-tuned models and human performance on datasets that probe relational reasoning. These models also fall short of providing interpretable predictions as the knowledge in their pretraining corpus is not explicitly stated, but rather is implicitly learned. It is thus difficult to recover the knowledge used in the reasoning process.
This has led many works to leverage knowledge graphs to improve machine reasoning ability to infer these latent relations for answering this kind of questions (Mihaylov and Frank, 2018;Lin et al., 2019;Yang et al., 2019). Knowledge graphs represent relational knowledge between entities into multi-relational edges, thus making it easier for a model to acquire relational knowledge and improve its reasoning ability. Moreover, incorporating knowledge graphs brings the potential of interpretable and trustworthy predictions, as the knowledge is now explicitly stated. For example, in Fig. 1, the relational path (CHILD → AtLocation → CLASSROOM → Synonym → SCHOOLROOM) naturally provides evidence for the answer SCHOOLROOM.
A straightforward approach to leveraging a knowledge graph is to directly model these relational paths. KagNet (Lin et al., 2019) and MH-PGM (Bauer et al., 2018) extract relational paths from knowledge graph and encode them with sequence models, resulting in multi-hop relations being explicitly modeled. Application of attention mechanisms upon these relational paths can further offer good interpretability. However, these models are hardly scalable because the number of possible paths in a graph is (1) polynomial w.r.t. the number of nodes (2) exponential w.r.t. the path length (see Fig. 2). Therefore, some (Weissenborn et al., 2017;Mihaylov and Frank, 2018) resort to only using one-hop paths, namely, triples, to balance scalability and reasoning capacities.
Graph neural networks (GNN), in contrast, enjoy better scalability via their message passing formulation, but usually lack transparency. The most commonly used GNN variant, Graph Convolutional Network (GCN) (Kipf and Welling, 2017), performs message passing by aggregating neighborhood information for each node, but ignores the relation types. RGCN (Schlichtkrull et al., 2018)  gregation, making it applicable to encoding multirelational graphs. However, these models do not distinguish the importance of different neighbors or relation types and thus cannot provide explicit relational paths for model behavior interpretation.
In this paper, we propose a novel graph encoding architecture, Multi-hop Graph Relation Networks (MHGRN), which combines the strengths of path-based models and GNNs. Our model inherits scalability from GNNs by preserving the message passing formulation. It also enjoys interpretability of path-based models by incorporating structured relational attention mechanism to model message passing routes. Our key motivation is to perform multi-hop message passing within a single layer to allow each node to directly attend to its multi-hop neighbours, thus enabling multi-hop relational reasoning with MHGRN. We outline the favorable features of knowledge-aware QA models in Table 1 and compare our MHGRN with representative GNNs and path-based methods.
We summarize the main contributions of this work as follows: 1) We propose MHGRN, a novel model architecture tailored to multi-hop relational reasoning. Our model is capable of explicitly modeling multi-hop relational paths at scale. 2) We propose structured relational attention mechanism for efficient and interpretable modeling of multihop reasoning paths, along with its training and inference algorithms. 3) We conduct extensive experiments on two question answering datasets and show that our models bring significant improvements compared to knowledge-agnostic pre-trained language models, and outperform other graph encoding methods by a large margin.

Problem Formulation and Overview
In this paper, we limit the scope to the task of multiple-choice question answering, although it can be easily generalized to other knowledgeguided tasks (e.g., natural language inference). The overall paradigm of knowledge-aware QA is illustrated in Fig. 3. Formally, given an external knowledge graph (KG) as the knowledge source and a Figure 3: Overview of the knowledge-aware QA framework. It integrates the output from graph encoder (for relational reasoning over contextual subgraphs) and text encoder (for textual understanding) to generate the plausibility score for an answer option. question q, our goal is to identify the correct answer from a set C of given choices. We turn this problem into measuring the plausibility score between q and each answer choice a ∈ C then selecting the answer with the highest plausibility score.
We use q and a to denote the representation vectors of question q and option a. To measure the score for q and a, we first concatenate them to form a statement s = [q; a]. Then we extract from the external KG a subgraph G (i.e., schema graph in KagNet (Lin et al., 2019)), with the guidance of s (detailed in §5.1). This contextualized subgraph is defined as a multi-relational graph G = (V, E, φ).
Here V is a subset of entity nodes in the external KG, containing only entities relevant to s. E ⊆ V × R × V is the set of edges that connect nodes in V, where R = {1, ⋯, m} are ids of all pre-defined relation types. The mapping function φ(i) ∶ V → T = {E q , E a , E o } takes node i ∈ V as input and outputs E q if i is an entity mentioned in q, E a if it is mentioned in a, or E o otherwise. We finally encode the statement to s, G to g, concatenate s and g, for calculating the plausibility score.

Background: Multi-Relational Graph Encoding Methods
We leave encoding of s to pre-trained language models which have shown powerful text representation ability, while we focus on the challenge of encoding graph G to capture latent relations between entities. Current methods for encoding multi-relational graphs mainly fall into two categories: graph neural networks and path-based models. Graph neural networks encode structured information by passing messages between nodes, directly operating on the graph structure, while path-based methods first decompose the graph into paths and then pool features over all the paths.  (Gilmer et al., 2017). A compact graph representation for G can thus be obtained by pooling over the node embeddings {h ′ i }: As a notable variant of GNNs, graph convolutional networks (GCNs) (Kipf and Welling, 2017) additionally update node embeddings by aggregating messages from its direct neighbors. RGCNs (Schlichtkrull et al., 2018) extend GCNs to encode multi-relational graphs by defining relationspecific weight matrix W r for each edge type: where N r i denotes neighbors of node i under relation r.
2 While GNNs have proved to have good scalability, their reasoning is done at the node level, thus making them incompatible with modeling pathlevel reasoning chains-a crucial component for QA tasks that require relational reasoning. This property also hinders the model's decisions from being interpreted at the path level. Graph Encoding with Path-Based Models. In addition to directly modeling the graph with GNNs, one can also view a graph as a set of relational paths connecting pairs of entities.
Relation Networks (RN) (Santoro et al., 2017) can be adapted to multi-relational graph encoding under QA settings. RNs use MLPs to encode all triples (one-hop paths) in G whose head entity is It then pools the triple embeddings to generate a vector for G as follows.
Here h j and h i are features for nodes j and i, e r is the embedding of relation r ∈ R, ⊕ denotes vector concatenation.  To further equip RN with the ability of modeling nondegenerate paths, KagNet (Lin et al., 2019) adopts LSTMs to encode all paths connecting question entities and answer entities with lengths no more than K. It then aggregates all path embeddings via attention mechanism: (4)

Proposed Method: Multi-Hop Graph Relation Networks (MHGRN)
This section presents Multi-hop Graph Relation Network (MHGRN), a novel graph neural network architecture that unifies both GNN and RN, for encoding multi-relational graphs to augment text comprehension. MHGRN inherits the capability of path reasoning and interpretabilty from path-based models, while preserving good scalability of GNN with the message passing formulation.

MHGRN: Model Architecture
Following the introduction to GNN in Sec. 3, we consider encoding a multi-relational graph G = (V, E, φ) into a fixed size vector g conditioned on textual representation s, by first transforming input node features {h 1 , . . . , h n } into node embeddings {h ′ 1 , . . . , h ′ n }, and then pooling these embeddings. Node features can be initialized with pre-trained weights (details in Appendix A) and we focus on the computation of node embeddings.
Type-Specific Transformation. To make our model aware of the node type information φ, we first perform node type specific linear transformation on the input node features: where the learnable parameters U and b are specific to the type of node i.
Multi-Hop Message Passing. As mentioned before, our motivation is to endow GNN with the capability of directly modeling paths. To this end, we propose to pass messages directly over all the relational paths of lengths up to K, where K is a hyper-parameter. The set of valid k-hop relational paths is defined as: We perform k-hop (1 ≤ k ≤ K) message passing over these paths, which is a generalization of the single-hop message passing in RGCN (see Eq. 2): where the W t r (1 ≤ t ≤ K, 0 ≤ r ≤ m) matrices are learnable 3 , α(j, r 1 , . . . , r k , i) is an attention score elaborated in §4.2 and d k is the normalization factor. The {W k . . , r k ≤ m} matrices can be interpreted as the low rank approximation of a {m×⋯×m} k ×d×d tensor that assigns a separate transformation for each k-hop relation, where d is the dim. of x i . Incoming messages from paths of different lengths are aggregated via attention mechanism (Vaswani et al., 2017): Non-linear Activation. Finally, we apply shortcut connection and nonlinear activation to obtain the output node embeddings.
where V and V ′ are learnable model parameters, and σ is a non-linear activation function.

Structured Relational Attention
The remaining problem becomes how to effectively parameterize the attention score α(j, r 1 , . . . , r k , i) in Eq. 7 for all possible k-hop paths without introducing O(m k ) parameters. We first regard it as the probability of a relation sequence (φ(j), r 1 , . . . , r k , φ(i)) conditioned on s: which can naturally be modeled by a probabilistic graphical model, such as conditional random field (Lafferty et al., 2001): where f (⋅), δ(⋅) and g(⋅) are parameterized by two-layer MLPs and τ (⋅) by a transition matrix of shape m × m. Intuitively, β(⋅) models the importance of a k-hop relation while γ(⋅) models the importance of messages from node type φ(j) to node type φ(i) (e.g. the model can learn to pass messages only from question entities to answer entities).
Our model scores a k-hop relation by decomposing it into both context-aware single-hop relations (modeled by δ) and two-hop relations (modeled by τ ). We argue that τ is indispensable, without which the model may assign high importance to illogical multi-hop relations (e.g., [AtLocation, CapableOf]) or noisy relations (e.g., [RelatedTo, RelatedTo]).

Computation Complexity Analysis
Although the multi-hop message passing process in Eq. 7 and the structured relational attention module in Eq.11 handles potentially exponential number of paths, we show that it can be computed in

Model
Time Space G is a dense graph G is a sparse graph with maximum node degree ∆ ≪ n

Learning, Inference and Path Decoding
We now discuss the learning and inference process of MHGRN instantiated for the task of multiplechoice question answering. Following the problem formulation in Sec. 2, we aim to determine the plausibility of an answer candidate a ∈ C given the question q with the information from both text s and graph G. We first obtained the graph representation g by performing attentive pooling over the output node embeddings of answer enti-   ties {h ′ i | i ∈ A}. Next we simply concatenate it with the text representation s and compute the plausibility score by ρ(q, a) = MLP(s ⊕ g).
During training, we maximize the plausibility score of the correct answerâ by minimizing the cross-entropy loss: . (12) The whole model is trained end-to-end jointly with the text encoder (e.g., RoBERTa). During inference, we predict the most plausible answer by argmax a∈C ρ(q, a). Additionally, we can decode a reasoning path as evidence for model predictions, endowing our model with the interpretability enjoyed by path-based models. Specifically, we first determine the answer entity i * with the highest score in the pooling layer and the path length k* with the highest score in Eq. 8. Then the reasoning path is decoded by argmax α(j, r 1 , . . . , r k * , i * ), which can be computed in linear time using dynamic programming.

Experimental Setup
We introduce how we construct G ( §5.1), the datasets ( §5.2), as well as the baseline methods ( §5.3). Appendix A shows more implementation and experimental details for reproducibility.  ConceptNet, with which we initialize our node set V. We then add to V all the entities that appear in any two-hop paths between pairs of mentioned entities. Unlike KagNet, we do not perform any pruning but instead reserve all the edges between nodes in V, forming our G.

Datasets
We evaluate models on two multiple-choice question answering datasets, CommonsenseQA and OpenbookQA. Both require world knowledge beyond textual understanding to perform well.
CommonsenseQA (Talmor et al., 2019) necessitates various commonsense reasoning skills. The questions are created with entities from ConceptNet and they are designed to probe latent compositional relations between entities in ConceptNet.
OpenBookQA  provides elementary science questions together with an open book of science facts. This dataset also probes general commonsense knowledge beyond the provided facts. As our model is orthogonal to text-form knowledge retrieval, we do not utilize the provided open book and instead use ConceptNet. Consequently, we do not compare our methods with those using the open book.

Compared Methods
We implement both knowledge-agnostic finetuning of pre-trained LMs and models that incorporate KG as external sources as our baselines. Additionally, we directly compare our model with the results from corresponding leaderboard. These methods typically leverage textual knowledge or extra training data, as opposed to external KG. In all our implemented models, we use pre-trained LMs as text encoders for s for fair comparison. We stick to our focus of encoding structured KG and therefore do not compare our models with those (Pan et al., 2019;Zhang et al., 2018;Banerjee et al., 2019) augmented by other text-form external knowledge (e.g., Wikipedia). Specifically

Results and Discussions
In this section, we present the results of our models in comparison with baselines as well as methods on the leaderboards for both CommonsenseQA and OpenbookQA. We also provide analysis of models' components and characteristics.

Main Results
For CommonsenseQA (   LMs, demonstrating the value of external knowledge on this dataset. Additionally, we evaluate our MHGRN (with the text encoder being ROBERTA-LARGE) on official split (Table 4) for fair comparison with other methods on leaderboard, in both single-model setting and ensemble-model setting.
In both cases, we achieve state-of-the-art performances across all existing models.
For OpenbookQA (Table 5), we use official split and build models with ROBERTA-LARGE as text encoder. Overall, external knowledge still brings benefit to this task. Our model surpasses all baselines, with an absolute increase of ∼2% on Test. As we seek to evaluate the performance of models in terms of encoding external knowledge graphs, we did not compare with other models based on document retrieval and reading comprehension. Note that other submissions on OBQA leaderboard usually were fine-tuned on external question answering datasets or using large corpus via information retrieval. As we seek to systematically examine knowledge-aware reasoning methods by reasoning over interpretable structures, we do not include comparisons with models that are beyond the scope. Other fine-tuned PTLMs can be simply incorporated as well whenever they are available.

Performance Analysis
Ablation Study on Model Components. We assess the impact of our models' components, shown in Table 6. Disabling type-specific transformation results in ∼ 1.3% drop in performance, demonstrating the need for distinguishing node type for QA tasks. Our structured relational attention mechanism is also critical, with its two sub-components contributing almost equally.
Impact of the Amount of Training Data. We use different fractions of training data of Common-senseQA and report results of fine-tuning text encoders alone and jointly training text encoder and graph encoder in Fig. 5. Regardless of training data fraction, our model shows consistently more performance improvement over knowledge-agnostic fine-tuning compared with the other graph encoding methods, indicating MHGRN's complementary strengths to text encoders. Impact of Number of Hops (K). We investigate the impact of hyperparameter K for MHGRN by its performance on CommonsenseQA (shown in Fig. 6). The increase of K continues to bring benefit until K = 4. However, performance begins to drop when K > 3. This might be attributed to exponential noise in longer relational paths in knowledge graph.

Model Interpretability
We can analyze our model's reasoning process by decoding the reasoning path using the method described in §4.5. Fig. 8 shows two examples from CommonsenseQA, where our model correctly answers the questions and provides reasonable path evidences. In the example on the left, the model links question entities and answer entity in a chain to support reasoning, while the example on the right provides a case where our model leverage unmentioned entities to bridge the reasoning gap between question entity and answer entities, in a way that is coherent with the implied relation between CHAPEL and the desired answer in the question.
Where is known for a multitude of wedding chapels?  Figure 8: Case study on model interpretability. We present two sampled questions from CommonsenseQA with the reasoning paths output by MHGRN.

Potential Compatibility with Other Methods
In theory, our approach is naturally compatible with the methods that utilize textual knowledge or extra data (such as leaderboard methods in Table 4), because in our paradigm the encoding of textual statement and graph are structurallydecoupled (Fig. 3). We can take, for example, the fine-tuned RoBERTa+KE 6 system as our text encoder and leave the rest of our model architecture unchanged.

Related Work
Knowledge-Aware Methods for NLP Various work have investigated the potential to empower NLP models with external knowledge. Many attempt to extract structured knowledge, either in the form of nodes (Yang and Mitchell, 2017;, triples (Weissenborn et al., 2017;Mihaylov and Frank, 2018), paths (Bauer et al., 2018;Kundu et al., 2019;Lin et al., 2019), or subgraphs (Li and Clark, 2015), and encode them to augment textual understanding. Recent success of pre-trained LMs motivates many (Pan et al., 2019;Ye et al., 2019;Zhang et al., 2018;Banerjee et al., 2019) to probe LMs' potential as latent knowledge bases. This line of work turn to textual knowledge (e.g. Wikipedia) to directly impart knowledge to pre-trained LMs. They generally fall into two paradigms: 1) Finetuning LMs on large-scale general-domain datasets (e.g. RACE (Lai et al., 2017)) or on knowledge-rich text. 2) Providing LMs with evidence via information retrieval techniques. However, these models cannot provide explicit reasoning and evidence, thus hardly trustworthy. They are also subject to the availability of in-domain datasets and maximum input token of pre-trained LMs.
Neural Graph Encoding Graph Attention Networks (GAT) (Velickovic et al., 2018) incorporates attention mechanism in feature aggregation, RGCN (Schlichtkrull et al., 2018) proposes relational message passing which makes it applicable to multi-relational graphs. However they only perform single-hop message passing and cannot be interpreted at path level. Other work (Abu-El-Haija et al., 2019; Nikolentzos et al., 2019) aggregate for a node its K-hop neighbors based on node-wise distances, but they are designed for non-relational graphs. MHGRN addresses these issues by reasoning on multi-relational graphs and being interpretable via maintaining paths as reasoning chains. 6 https://github.com/jose77/csqa/

Conclusion
We present a principled, scalable method, MHGRN, that can leverage general knowledge by multi-hop reasoning over interpretable structures (e.g. Con-ceptNet). The proposed MHGRN generalizes and combines the advantages of GNNs and path-based reasoning models. It explicitly performs multi-hop relational reasoning and is empirically shown to outperform existing methods with superior scalablility and interpretability. Our extensive experiments systematically compare MHGRN and other existing methods on knowledge-aware methods. Particularly, we achieve the state-of-the-art performance on the CommonsenseQA dataset.

A Implementation Details
CommonsenseQA OpenbookQA  Our models are implemented in PyTorch. We use cross-entropy loss and RAdam (Liu et al., 2019a) optimizer. We find it beneficial to use separate learning rates for the text encoder and the graph encoder. We tune learning rates for text encoders and graph encoders on two datasets. We first fine-tune ROBERTA-LARGE, BERT-LARGE, BERT-BASE on CommonsenseQA and ROBERTA-LARGE on OpenbookQA respectively, and choose a datasetspecific learning rate from {1 × 10 −5 , 2 × 10 −5 , 3 × 10 −5 , 6 × 10 −5 , 1 × 10 −4 } for each text encoder, based on the best performance on development set, as listed in Table 7. We report the performance of these fine-tuned text encoders and also adopt their dataset-specific optimal learning rates in joint training with graph encoders. For models that involve KG, the learning rate of their graph encoders 7 are chosen from {1 × 10 −4 , 3 × 10 −4 , 1 × 10 −3 , 3 × 10 −3 }, based on their best development set performance with ROBERTA-LARGE as the text encoder. We report the optimal learning rates for graph encoders in Table 8. In training, we set the maximum input sequence length to text encoders to 64, batch size to 32, and perform early stopping. For the input node features, we first use templates to turn knowledge triples in Common-senseQA into sentences and feed them into pretrained BERT-LARGE, obtaining a sequence of token embeddings from the last layer of BERT-LARGE for each triple. For each entity in Com-monsenseQA, we perform mean pooling over the tokens of the entity's occurrences across all the sentences to form a 1024d vector as its corresponding node feature. We use this set of features for all our implemented models.
We use 2-layer RGCN and single-layer Multi-GRN across our experiments.