Message Passing for Hyper-Relational Knowledge Graphs

Hyper-relational knowledge graphs (KGs) (e.g., Wikidata) enable associating additional key-value pairs along with the main triple to disambiguate, or restrict the validity of a fact. In this work, we propose a message passing based graph encoder - StarE capable of modeling such hyper-relational KGs. Unlike existing approaches, StarE can encode an arbitrary number of additional information (qualifiers) along with the main triple while keeping the semantic roles of qualifiers and triples intact. We also demonstrate that existing benchmarks for evaluating link prediction (LP) performance on hyper-relational KGs suffer from fundamental flaws and thus develop a new Wikidata-based dataset - WD50K. Our experiments demonstrate that StarE based LP model outperforms existing approaches across multiple benchmarks. We also confirm that leveraging qualifiers is vital for link prediction with gains up to 25 MRR points compared to triple-based representations.


Introduction
The task of link prediction over knowledge graphs (KGs) has seen a wide variety of advances over the years (Ji et al., 2020). The objective of this task is to predict new links between entities in the graph based on the existing ones. A majority of these approaches are designed to work over triplebased KGs, where facts are represented as binary relations between entities. This data model, however, doesn't allow for an intuitive representation of facts with additional information. For instance, in Fig. 1.A, it is non-trivial to add information which can help disambiguate whether the two universities attended by Albert Einstein awarded him with the same degree.
This additional information can be provided in the form of key-value restrictions over instances of binary relations between entities in recent knowledge graph models (Vrandecic and Krötzsch, 2014;Pellissier-Tanon et al., 2020;Ismayilov et al., 2018). Such restrictions are known as qualifiers in the Wikidata statement model (Vrandecic and Krötzsch, 2014) or triple metadata in RDF* (Hartig, 2017) and RDF reification approaches (Frey et al., 2019). These complex facts with qualifiers can be represented as hyper-relational facts (See Sec. 3). In our example ( Fig. 1.B), hyper-relational facts allow to observe that Albert Einstein obtained different degrees at those universities.
Existing representation learning approaches for such graphs largely treat a hyper-relational fact as an n-ary (n>2) composed relation (e.g., educatedAt academicDegree) (Zhang et al., 2018;Liu et al., 2020) losing entity-relation attribution; ignoring the semantic difference between a triple relation (educatedAt) and qualifier relation (academicDegree) (Guan et al., 2019), or decomposing a hyper-relational instance into multiple quintuples comprised of a triple and one qualifier key-value pair (Rosso et al., 2020). In this work, we propose an alternate graph representation learning mechanism capable of encoding hyper-relational KGs with arbitrary number of qualifiers, while keeping the semantic roles of qualifiers and triples intact.
To accomplish this, we leverage the advances in Graph Neural Networks (GNNs), many of which are instances of the message passing (Gilmer et al., 2017) framework, to learn latent representations of nodes and edges of a given graph. Recently, GNNs have been demonstrated (Vashishth et al., 2020) to be capable of encoding mutli-relational (tripled based) knowledge graphs. Inspired by them, we further extend this framework to incorporate hyperrelational KGs, and propose STARE 1 , which to the best of our knowledge is the first GNN-based approach capable of doing so (see Sec. 4).
Furthermore, we show that WikiPeople (Guan et al., 2019), and JF17K (Wen et al., 2016) -two commonly used benchmarking datasets for LP over hyper-relational KGs exhibit some design flaws, which render them as ineffective benchmarks for the hyper-relational link prediction task (see Sec. 5). JF17K suffers from significant test leakage, while most of the qualifier values in WikiPeople are literals which are conventionally ignored in KG embedding approaches, rendering the dataset largely devoid of qualifiers. Instead, we propose a new hyper-relational link prediction dataset -WD50K extracted from Wikidata (Vrandecic and Krötzsch, 2014) that contains statements with varying amounts of qualifiers, and use it to benchmark our approach.
Through our experiments (Sec. 6), we find that STARE based model generally outperforms other approaches on the task of link prediction (LP) over hyper-relational knowledge graphs. We provide further evidence of the fact, independent of STARE, that triples enriched with qualifier pairs provide additional signal beneficial for the LP task.

Related Work
Early approaches for modelling hyper-relational graphs stem from conventional triple-based KG embedding algorithms, which often simplify complex property attributes (qualifiers). For instance, m-TransH (Wen et al., 2016) requires star-to-clique conversion which results in a permanent loss of entity-relation attribution. Later models, e.g., RAE (Zhang et al., 2018), HypE and HSimple introduced in (Fatemi et al., 2020), converted hyperrelational facts into n-ary facts with one abstract relation which is supposed to loosely represent a combination of all relations of the original fact.
Recently, GETD (Liu et al., 2020) extended TuckER (Balazevic et al., 2019) tensor factorization approach for n-ary relational facts. However, the model still expects only one relation in a fact and is not able to process facts of different arity in one dataset, e.g., 3-ary and 4-ary facts have to be split and trained separately.
NaLP (Guan et al., 2019) is a convolutional model that supports multiple entities and relations in one fact. However, every complex fact with k qualifiers has to be broken down into k + 2 keyvalue pairs with an artificial split of the main (s,p,o) triple into (p s : s) and (p o : o) pairs. Consequently, all key-value pairs are treated equally thus the model does not distinguish between the main triple and relation-specific qualifiers.
HINGE (Rosso et al., 2020) also adopts a convolutional framework for modeling hyper-relational facts. A main triple is iteratively convolved with every qualifier pair as a quintuple followed by min pooling over quintuple representations. Although it retains the hyper-relational nature of facts, HINGE operates on a triple-quintuple level that lacks granularity of representing a certain relation instance with its qualifiers. Additionally, HINGE has to be trained sequentially in a curriculum learning (Bengio et al., 2009) fashion requiring sorting all facts in a KG in an ascending order of the amount of qualifiers per fact which might be prohibitively expensive for large-scale graphs.
Instead, our approach directly augments a relation representation with any number of attached qualifiers properly separating auxiliary entities and relations from those in the main triple. Additionally, we do not force any restrictions on input order of facts nor on the amount of qualifiers per fact.
Parallel to our approach are the methods that work over hypergraphs, e.g., DHNE (Tu et al., 2018), Hyper-SAGNN (Zhang et al., 2020), and knowledge hypergraphs like HypE (Fatemi et al., 2020). We deem hyper-relational graphs and hypergraphs are conceptually different. As hyperedges contain multiple nodes, such hyperedges are closer to n-ary relations r(e 1 , . . . , e n ) with one abstract relation. The attribution of entities to the main triple or qualifiers is lost, and qualifying relations are not defined. Combining a certain set of main and qualifying relations into one abstract r k () would lead to a combinatorial explosion of typed hyperedges since, in principle, any relation could be used in a qualifier, and there the amount of qualifiers per fact is not limited. Therefore, modeling qualifiers in hypergraphs becomes non-trivial, and we leave such a study for future work.

Preliminaries
GNNs on Undirected Graphs: Consider an undirected graph G = (V, E), where V represents the set of nodes and E denotes the set of edges. Each node v ∈ V has an associated vector h v and neighbourhood N (v). In the message passing framework (Gilmer et al., 2017), the node representations are learned iteratively via aggregating representations (messages) from their neighbors: where AGGR(·) and UPD(·) are differentiable functions for neighbourhood aggregation and node update, respectively; h (k) v is the representation of a node v at layer k; e vu is the representation of an edge between nodes v and u.
Different GNN architectures implement their own aggregation and update strategy. For example, in case of Graph Convolutional Networks (GCNs) (Kipf and Welling, 2017) the representations of neighbours are first transformed via a weight matrix W and then combined and passed through a non-linearity f (·) such as ReLU. A GCN layer k can be represented as: GCN and other seminal architectures such as GAT (Velickovic et al., 2018) and GIN (Xu et al., 2019) do not model relation embeddings explicitly and require further modifications to support multirelational KGs.
GNN on Directed Multi-Relational Graphs: In case of a multi-relational graph G = (V, R, E) where R represents the set of relations r, and E denotes set of directed edges (s, r, o) where nodes s ∈ V and o ∈ V are connected via relation r. The GCN formulation by (Marcheggiani and Titov, 2017) assumes that the information in a directed edge flows in both directions. Thus for each edge (s, r, o), an inverse edge (o, r −1 , s) is added to E. Further, self-looping relations (v, r self , v), for each node v ∈ V are added to E, enabling an update of a node state based on its previous one, and further improving normalization.
For directed multi-relational graphs, Equation 2 can be extended by introducing relation specific weights W r (Marcheggiani and Titov, 2017;Schlichtkrull et al., 2018) However, such networks are known to be overparameterized. Instead, CompGCN (Vashishth et al., 2020) proposes to learn specific edge type vectors: (4) where φ(·) is a composition function of a node u with its respective relation r, and W λ(r) is a direction-specific shared parameter for incoming, outgoing, and self-looping relations. The composition φ : R d × R d → R d can be any entity-relation function akin to TransE (Bordes et al., 2013) or DistMult (Yang et al., 2015).
Hyper-Relational Graphs: In case of a hyper-relational graph G = (V, R, E), E is a list (e 1 , . . . , e n ) of edges with e j ∈ V × R × V × P(R × V) for 1 ≤ j ≤ n, where P denotes the power set. A hyperrelational fact e j ∈ E is usually written as a tuple (s, r, o, Q), where Q is the set of qualifier pairs {(qr i , qv i )} with qualifier relations qr i ∈ R and qualifier values qv i ∈ V. (s, r, o) is referred to as the main triple of the fact. We use the notation Q j to denote the qualifier pairs of e j . For example, under this representation scheme, one of the edges in Fig. 1.B would be (Albert Einstein, educated at, University of Zurich, (academic degree, Doctorate), (academic major, Physics)) r q r1 q v1 q r2 q v2 o ϕ q ϕ q ∑ P 6 9 P 5 1 2 Q 8 4 9 6 9 7 P 8 1 2 Q 8 5 3 0 7 7 Q 2 0 6 7 0 2 W q γ ϕ r ∑ s W λ(r) Figure 2: The mechanism in which STARE encodes a hyper-relational fact from Fig. 1.B. Qualifier pairs are passed through a composition function φ q , summed and transformed by W q . The resulting vector is then merged via γ, and φ r with the relation and object vector, respectively. Finally, node Q937 aggregates messages from this and other hyper-relational edges.

STARE
In this section, we introduce our main contribution -STARE, and show how we use it for link prediction (LP). STARE (cf. Fig. 2 for the intuition) incorporates statement qualifiers {(qr i , qv i )}, along with the main triple (s, r, o) into a message passing process. To do this, we extend Equation 4 by combining the edge-type embedding h r with a fixed-length vector h q representing qualifiers associated with a particular relation r between nodes u and v. The resultant equation is thus: is a function that combines the main relation representation with the representation of its qualifiers, e.g., concatenation [h r , h q ], elementwise multiplication h r h rq , or weighted sum: where α is a hyperparameter that controls the flow of information from qualifier vector h q to h r .
Finally, the qualifier vector h q is obtained through a composition φ q of a qualifier relation h qr and qualifier entity h qv . The composition function φ q may be any entity-relation function akin to φ (Equation 4). The representations of different r q r1 q v1 q r2 q v2 o ϕ q ϕ q ∑ P 6 9 P 5 1 2 Q 8 4 9 6 9 7 P 8 1 2 Q 8 5 3 0 7 7 Q 2 0 6 7 0 2

R V StarE Encoder Updated
Embedding Matrices R V r q r1 q v1 q r2 q v2 o ϕ q ϕ q ∑ P 6 9 P 5 1 2 Q 8 4 9 6 9 7 P 8 1 2 Q 8 5 3 0 7 7 Q 2 0 6 7 0 2 ..  : Architecture of a STARE based link prediction model. STARE updates theV,R matrices, which are then used to encode the relations in a given query before passing them through the Transformer, Pooling and fully connected layers. The fixed-dimensional output is then compared toV, the result of which is passed through a sigmoid function to yield a probability distribution over entities.
qualifier pairs are then aggregated via a positioninvariant summation function and passed through a parameterized projection W q : This formalisation allows to (i) incorporate an arbitrary number of qualifier pairs and (ii) can take into account whether entities/relations occur in the main triple or the qualifier pairs. STARE is the first GNN model for representation learning of hyperrelational KGs that has these characteristics.
STARE for Link Prediction. STARE is a general representation learning framework for capturing the structure of hyper-relational graphs, and thus can be applied to multiple downstream tasks. In this work, we focus on LP and leave other tasks such as node classification for future work. In LP, given a query (s, r, Q), the task is to predict an entity corresponding to the object position o.
Our link prediction model (see Fig. 3) is composed of two parts namely (i) a STARE based In every iteration, STARE updates the embeddings (R,V) by message passing across every edge in the training set. In the decoding step, we first linearize the given query, and use the updated embeddings (R,V) to encode the entities and relations within it. Then, this linearized sequence is passed through the Transformer block, whose output is averaged to get a fixed-dimensional vector representation of the query. The vector is then passed through a fully-connected layer, multiplied withV and then passed through a sigmoid, to obtain a probability distribution over all entities. Thereafter, it is trivial to retrieve the top n candidate entities for the o position in the query. Note that we can use different decoders in this architecture. An explanation and evaluation of few decoders is provided in Appendix D.

WD50K Dataset
Recent approaches (Guan et al., 2019;Liu et al., 2020;Rosso et al., 2020) for embedding hyperrelational KGs often use WikiPeople and JF17K as benchmarking datasets. We advocate that those datasets can not fully capture the task complexity.
In WikiPeople, about 13% of statements contain at least one literal. Literals (e.g. numeric values, date-time instances or other strings, etc) in KGs are conventionally ignored (Rosso et al., 2020) by embedding approaches, or are incorporated through specific means (Kristiadi et al., 2019). However, after removing statements with literals, less than 3% of the remaining statements contain any qualifier pairs. Out of those, about 80% possess only one qualifier. This fact renders WikiPeople less sensitive to hyper-relational models as performance on triple-only facts dominates the overall score.
The authors of JF17K reported 3 the dataset to contain redundant entries. In our own analysis, we detected that about 44.5% of the test statements share the same main (s, r, o) triple as the train statements. We consider this fact as a major data leakage which allows triple-based models to memorize subjects and objects appearing in the test set.
To alleviate the above problems, we propose a new dataset, WD50K, extracted from Wikidata statements. The following steps are used to sample our dataset from the Wikidata RDF dump of August 2019 4 . We begin with a set of seed nodes corresponding to entities from FB15K-237 having a direct mapping in Wikidata (P646 "Freebase ID"). Then, for each seed node, all statements whose main object and qualifier values correspond to wikibase:Item are extracted. This step results in the removal of all literals in object position. Similarly, all literals are filtered out from the qualifiers of the obtained statements. To increase the connectivity in the statements graph, all the entities mentioned less than twice are dropped.
All the statements of WD50K are randomly split into the train, test, and validation sets. To eliminate test set leakages we remove all statements from train and validation sets that share the same main triple (s,p,o) with test statements. Finally, we remove statements from the test set that contain entities and relations not present in the train or validation sets. WD50K contains 236,507 statements describing 47,156 entities with 532 relations where about 14% of statements have at least one qualifier pair. See Table 3, and Appendix A for further details. The dataset is publicly available 5 .

Experiments
In this section, we discuss the setup and results of multiple experiments conducted towards (i) assessing the performance of our proposed approach on the link prediction task, and (ii) analyzing the effects of including hyper-relational information during link prediction.

Evaluating STARE on the LP Task
In this experiment, we evaluate our proposed approach on the task of LP over hyper-relational graphs. We designed it to both compare STARE with the state of the art algorithms, and to better understand the contribution of the STARE encoder. Datasets: We use WikiPeople 6 and JF17K 7 , despite their design flaws (see Sec. 5) to illustrate the performance differences with existing approaches. We also provide a benchmark of our approach on the WD50K dataset introduced in this article. Note that as described by (Rosso et al., 2020), we drop all statements containing literals in WikiPeople. Further datasets statistics are presented in Table 1.
Baselines: In this experiment, we compare against previous hyper-relational approaches namely: (i) m-TransH ( To assess the significance of the STARE encoder, we also train a simpler model where the Trans-6 Downloaded from: https://github.com/ gsp2014/NaLP/tree/master/data/WikiPeople 7 Downloaded from: https://www. dropbox.com/sh/ryxohj363ujqhvq/ AAAoGzAElmNnhXrWEj16UiUga?dl=0 former based decoder directly uses the randomly initialized embedding matrices without the STARE encoder. We call this model Transformer (H), and the one with the STARE encoder STARE (H) + Transformer (H). Here (H) represents that the input to the model is a hyper-relational fact. Later, we also experiment with triples as input and represent them with (T) (see Sec. 6.4).
Evaluation: For all the systems discussed above, we report various performance metrics when predicting the subject and object of hyperrelational facts. We adopt the filtered setting introduced in (Bordes et al., 2013) for computing mean reciprocal rank (MRR) and hits at 1, 5, and 10 (H@1, H@5, H@10). The metrics are computed for subject and object prediction separately and are then averaged.
Training: We train the model in 1-N setting using binary cross entropy loss with label smoothing as in (Dettmers et al., 2018;Vashishth et al., 2020) with Adam (Kingma and Ba, 2015) optimizer for 500 epochs on WikiPeople and for 400 epochs on JF17K and WD50K datasets. Hyperparameters were selected by manual fine tuning with further details in Appendix C. STARE is implementated with PyTorch Geometric (Fey and Lenssen, 2019) and is publicly available here 8 .
Results and Discussion: The results of this experiment can be found in Table 2. We observe that the STARE encoder based model outperforms the other hyper-relational models across WikiPeople and JF17K. On JF17K, STARE (H) + Transformer (H) reports a gain of 11.3 (25%) MRR points, 13 (33%) H@1, and 7.8 (12%) H@10 points when  compared to the next-best approach. Recall that JF17K suffers from a major test set leakage (Sec. 5), which we investigate in greater detail in Exp. 4 (Sec. 6.4) below. On WikiPeople, HINGE has a higher H@1 score than STARE (H) + Transformer (H). However, its H@10 is lower than H@5 of our approach, i.e., top five predictions of the STARE model are more likely to contain a correct answer than top 10 predictions of HINGE. We can thus claim our STARE based model to be competitive with, if not outperforming the state of the art on the task of link prediction over hyper-relational KGs, albeit on less-than-ideal baselines.
We further present the performance of our approach as a baseline on the WD50K dataset in Table 3. With an MRR score of 0.349, H@1 of 0.271, and H@10 of 0.496, we find that the task is far from solved, however, the STARE-based approaches provide effective, non-trivial baselines.
Note that Transformer (H) (without STARE) also performs competitively to HINGE. This suggests that the aforementioned gains in metrics of our approach cannot all be attributed to STARE's innate ability to effectively encode the hyper-relational information. That said, upon comparing the performance of STARE (H) + Transformer (H) and Transformer (H), we find that using STARE is consistently advantageous across all the datasets.

Impact of Ratio of Statements with and Without Qualifier Pairs
Based on the relatively high performance of Transformer (H) (without the encoder) in the previous experiment, we study the relationship between the amount of hyper-relational information (qualifiers), and the ability of STARE to incorporate it for the LP task. Here, we sample datasets from WD50K, with varying ratio of facts with qualifier pairs to the total number of facts in the KG. Specifically, we sample three datasets namely, WD50K (33), WD50K (66), and WD50k (100) containing approximately 33%, 66%, and 100% of such hyper-relational facts, respectively. We use the same experimental setup as the one discussed in the previous section. Table 3 presents the result of this experiment. We observe that across all metrics, STARE (H) + Transformer (H) performs increasingly better than Transformer (H), as the ratio of qualifier pairs increases in the dataset. Concretely, the difference in their H@1 scores is 4.1, 6.8, and 8.9 points on WD50K (33), WD50K (66), and WD50K (100) respectively. These and the Sec. 6.1 results confirm that STARE is better suited to utilize the qualifier information available in the KG, (ii) which when leveraged by a transformer decoder, outperforms other hyper-relational LP approaches, and (iii) that STARE's positive effects increases as the amount of qualifiers in the task increases.

Impact of Number of Qualifiers per Statement
In WD50K, as in Wikidata, the number of qualifiers corresponding to a statement varies significantly (see Appendix A). In this experiment, we intend to quantify its effect on the model performance.
To do so, we create multiple variants of WD50K, each containing statements with up to n qualifiers(n ∈ [1, 6]). In other words, for a given number n, we collect all the statements which have less than n qualifiers. If a statement contains more than n qualifiers, we arbitrarily choose n qualifiers amongst them. Thus, the total number of facts remains the same across these variants. Figure 4 presents the result of this experiment.
For all the datasets, we find that two qualifier pairs are enough for our model performance to saturate. This might be an attribute of the underlying characteristic of the dataset or the model's inability to aggregate information from longer statements. We leave the further analysis of this for the future work. However, we observe that in case of WD50K and other datasets, STARE (H) + Transformer (H) slightly improves or remains stable with increase of statement length, while Transformer (H) shows degradation in performance.

Comparison to Triple Baselines
To further understand the role of qualifier information in the LP task, we design an experiment to gauge the performance difference between models on hyper-relational KG and triplebased KG. Concretely, we create a new tripleonly dataset by pruning all qualifier information from the statements in WikiPeople, JF17K, and WD50K. That is, two statements that describe the same main fact (s, r, o, {(qr 1 , qv 1 ), (qr 2 , qv 2 )} and (s, r, o, {(qr 3 , qv 3 ), (qr 4 , qv 4 )}) are reduced to one triple (s, r, o). Thus, the overall amount of distinct entities and relations is reduced, but the amount of subjects and objects in main triples for the LP task is the same. We introduce STARE (T) + Transformer (T), a model for this experiment. STARE (T) is similar to CompGCN (Vashishth et al., 2020), and can only model triple-based (s, r, o) facts. Since inputs to the Transformer decoder are linearized queries, we can trivially implement Transformer (T) by ignoring qualifier pairs during this linearization. The results are available in Table 2, and Table 3.
We observe that triple-only baselines yield competitive results on JF17K and WikiPeople compared to hyper-relational models (See Table 2). As WikiPeople contains less than 3% of hyperrelational facts, the biggest contribution to the overall performance is dominated by the triple-only performance. We attribute the strong performance of the triple-only baseline on JF17K to the identified data leakage pertaining to this dataset. In other words, JF17K in its hyper-relational form exhibits similar issues identified by (Akrami et al., 2020) as in FB15k and WN18 datasets proposed in (Bordes et al., 2013) for triple-based LP task. We thus perform another experiment after cleaning JF17K from the assumed data leakage and report the results in Table 4 below. We observe a drastic performance drop of about 20 MRR points in both models which provide experimental evidence of the flaws discussed in Sec. 5. We encourage future works in this domain to refrain from using these datasets in experiments.
In the case of WD50K (where about 13% of facts have qualifiers) the STARE (H) + Transformer (H) yields about 16%, 23%, and 11% of relative improvement over the best performing triple-only baseline across MRR, H@1 and H@10, respectively (see Table 3). Akin to the previous experiment, we observe that increasing the ratio of hyperrelational facts in the dataset leads to even higher performance boosts. In particular, on WD50K (100), the H@1 of our hyper-relational model is higher than the H@10 of the triple baseline. This difference corresponds to 30 MRR and 32 H@1 points which is about 85% and 123% relative improvement, respectively.
Based on the above observations we therefore conclude, that information in hyper-relational facts indeed helps to better predict subjects and objects in the main triples of those facts.
We presented STARE, an instance of the message passing framework for representation learning over hyper-relational KGs. Experimental results suggest that STARE performs competitively on link prediction tasks over existing hyper-relational approaches and greatly outperforms triple-only baselines. In the future, we aim at applying STARE for node and graph classification tasks as well as extend our approach to large-scale KGs.
We also identified significant flaws in existing link prediction datasets and proposed WD50K, a novel, Wikidata-based hyper-relational dataset that is closer to real-world graphs and better captures the complexity of the link prediction task. In the future, we plan to enrich WD50K entities with class labels and probe it against node classification tasks.

A Further details on WD50K
In contrast with Freebase which is no longer supported nor updated, we choose Wikidata as the source KG for our dataset since it has an active community and has seen contributions from various companies that merge their knowledge with it. Additionally, many new NLP tasks (Xiong et al., 2020;Hayashi et al., 2019;Chakraborty et al., 2019), as well as datasets (Wang et al., 2019b;Mesquita et al., 2019;Dubey et al., 2019), are using Wikidata as a reference KG. The combined statistics of our dataset are presented in Table 1. WD50k consists of 47,156 entities, and 532 relations, amongst which 5,460 entities and 45 relations are found only within qualifier (q p , q e ) pairs. Fig. 5 illustrates how qualifiers are distributed among statements, i.e., 236,393 statements (99.9%) contain up to five qualifiers whereas remaining 114 statements in a long tail contain up to 20 qualifiers. Fig. 6 illustrates the in-degree dis-0 1 2 3 4 5 6 7 8 9 101112131516171920 Recall that we augmented our dataset to reduce test set leakage by removing all instances from the train, and validation sets whose main triple (s, p, o) can be found in the test instances (Sec. 5). Another form of test leakage, as discovered in (Toutanova and Chen, 2015), may still persist in our dataset. To estimate this, we count the instances in the test set whose main triple's "direct" inverse (o, p, s), or "semantic" inverse (based on the relation P1696 in Wikidata, i.e., inverse of) is present in the train set. This amounts to less than 4% (1.6k out of 46k) instances in the test set.   Each fact has a unique integer index k which is shared between two COO matrices, i.e., the first one is for main triples, the second one is for qualifiers. Qualifiers that belong to the same fact share the index k.
Storing full adjacency matrices of large KGs is impractical due to O(|V| 2 ) memory consumption. GNNs encourage using sparse matrix representations and adopting sparse matrices is shown (Cohen et al., 2020) to be scalable to graphs with millions of edges. As illustrated in Figure 7, we employ two sparse COO matrices to model hyper-relational KGs. The first COO matrix is of a standard format with rows containing indices of subjects, objects, and relations associated with the main triple of a hyper-relational fact.
In addition, we store index k that uniquely identifies each fact. The second COO matrix contains rows of qualifier relations qr and entities qe that are connected to their main triple (and the overall hyper-relational fact) through the index k, i.e., if a fact has several qualifiers those columns corresponding to the qualifiers of the fact will share the same index k. The overall memory consumption is therefore O(|E| + |Q|) and scales linearly to the total number of qualifiers |Q|. Given that most open-domain KGs rarely qualify each fact, e.g., as of August 2019, out of 734M Wikidata statements approximately 128M (17.4%) have at least one qualifier, this sparse qualifier representation saves limited GPU memory.

C Hyperparameters
We tuned the model (STARE encoder with Transformer decoder) on the validation set using the hyperparameters reported in Table 5. Implementations of mult, ccorr, and rotate functions in φ q and φ r correspond to DistMult (Yang et al., 2015), circular correlation (Nickel et al., 2016), and Ro-tatE (Sun et al., 2019), respectively. The selected hyperparameters include two STARE layers, embedding dimension of 200, batch size of 128, Adam optimizer with 0.0001 learning rate and 0.1 label smoothing. φ r and φ q are rotate functions, γ(·) is a weighted sum function with α of 0.8, qualifiers are aggregated using a simple summation, and 0.3 dropout rate. We use 2-layer Transformer block with the hidden dimension of 512, and 4 attention heads with 0.1 dropout rate as our decoder. For WD50K and JF17K datasets we set the maximum length of a hyper-relational fact to 15 (i.e., a statement can contain at most 6 qualifier pairs), and 7 for WikiPeople.
Infrastructure and Parameters. We train all models on one Tesla V100 GPU. Due to a large number of parameters, owing to large trainable embedding matrices, it is advisable to a GPU with at least 12GB of VRAM. Running STARE (H) + Transformer (H) models with the selected hyperparams on WD50K requires approximately 2 days to train and has 10.8M parameters 9 ; on JF17k the model has 7.1M parameters and takes about 10 hours to train; on WikiPeople the model has 8.2M parameters which we run for 500 epochs and takes about 4 days.
StarE (H) + Transformer (H) models on reduced datasets: the model corresponding to WD50K (33) has 9M parameters and takes 20 hours to train while WD50K model has 6.8M parameters and takes about 9 hours to train. In case of WD50K (100), the model has 5M parameters and takes 5 hours to train.

D Decoders
As an additional experiment, we pair STARE with different decoders and evaluate them over WD50K datasets. Along with the main reported model denoted as StarE + Trf, we implemented two CNNbased decoders and another Transformer-based decoder. All models are trained with the same encoder hyperparameters as chosen in the main reported model.
StarE + ConvE relies on the ConvE (Dettmers et al., 2018)-like decoder but expanded for statements with qualifiers. Given a query (s, r, {(qr i , qv i ), ... }), we stack entities and relations embeddings row-wise and reshape the tensor into an image of size H × W . For instance, for a statement with 6 qualifier pairs, i.e., query length of 14, and an embedding size of 200, we obtain images of size 40 × 70. We then apply a 2D convolutional layer with a 7 × 7 kernel for each image, apply ReLU, flatten the resulting tensor, and pass it through a fully-connected layer. We used 200 filters and the learning rate was set to 0.001.
StarE + ConvKB is based on the Con-vKB (Nguyen et al., 2018)-like decoder adjusted for statements with qualifiers. Given a query (s, r, {(qr i , qv i ), ... }), we stack entities and relations embeddings row-wise and apply a 2D convolutional layer with a L Q × 7 kernel, e.g., for queries of length 14 the kernel size is 14 × 7. We then apply ReLU, flatten the resulting tensor, and  We then pass it through the Transformer layers and retrieve the representation of the [MASK] token. Finally, the token representation is passed through a fully-connected layer. We trained the model with 0.0001 as the learning rate. Table 6 reports link prediction results on a variety of WD50K datasets with with different decoders. The default StarE + Trf decoder generally attains superior results with biggest gains along H@1 metric.

E Relation-Qualifiers Aggregation
In this experiment, we measure the impact of the choice of γ(·) function which is used for aggregating representations of a relation and its qualifiers (see Eq. 5). To evaluate its impact we use STARE (H) + Transformer (H) models, on four WD50K datasets using three functions, i.e., concatenation [h r , h q ], element-wise multiplication h r h q , and weighted sum α h r + (1 − α) h q where α is fixed to 0.8.
The results are presented in Fig.8. We find that all the three settings have similar performance indi-cating model's stability with respect to the choice of γ(·) function.