LibKGE - A knowledge graph embedding library for reproducible research

LibKGE ( https://github.com/uma-pi1/kge ) is an open-source PyTorch-based library for training, hyperparameter optimization, and evaluation of knowledge graph embedding models for link prediction. The key goals of LibKGE are to enable reproducible research, to provide a framework for comprehensive experimental studies, and to facilitate analyzing the contributions of individual components of training methods, model architectures, and evaluation methods. LibKGE is highly configurable and every experiment can be fully reproduced with a single configuration file. Individual components are decoupled to the extent possible so that they can be mixed and matched with each other. Implementations in LibKGE aim to be as efficient as possible without leaving the scope of Python/Numpy/PyTorch. A comprehensive logging mechanism and tooling facilitates in-depth analysis. LibKGE provides implementations of common knowledge graph embedding models and training methods, and new ones can be easily added. A comparative study (Ruffinelli et al., 2020) showed that LibKGE reaches competitive to state-of-the-art performance for many models with a modest amount of automatic hyperparameter tuning.


Introduction
Knowledge graphs (KG) (Hayes-Roth, 1983) encode real-world facts as structured data. A knowledge graph can be represented as a set of (subject, relation, object)-triples, where the subject and object entities correspond to vertices, and relations to labeled edges in a graph.
KG embedding (KGE) models represent the KG's entities and relations as dense vectors, termed embeddings. KGE models compute a score based on these embeddings and are trained with the objective of predicting high scores for true triples and 1 https://github.com/uma-pi1/kge low scores for false triples. Link prediction is the task of predicting edges missing in the KG (Nickel et al., 2015). Some uses of KGE models are: enhancing the knowledge representation in language models (Peters et al., 2019), drug discovery in biomedical KGs (Mohamed et al., 2019), as part of recommender systems (Wang et al., 2017), or for visual relationship detection (Baier et al., 2017).
KGE models for link prediction have seen a heightened interest in recent years. Many components of the KGE pipeline-i.e., KGE models, training methods, evaluation techniques, and hyperparameter optimization-have been studied in the literature, as well as the whole pipeline itself (Nickel et al., 2016;Wang et al., 2017;Ali et al., 2020). Ruffinelli et al. (2020) argued that it is difficult to reach a conclusion about the impact of each component based on the original publications. For example, multiple components may have been changed simultaneously without performing an ablation study, baselines may not have been trained with state-of-the-art methods, or the hyperparameter space may not have been sufficiently explored.
LIBKGE is an open-source KGE library for reproducible research. It aims to facilitate meaningful experimental comparisons of all components of the KGE pipeline. To this end, LIBKGE is faithful to the following principles: Modularization and extensibility. LIBKGE is cleanly modularized. Individual components can be mixed and matched with each other, and new components can be easily added.

Configurability
and reproducibility. In LIBKGE an experiment is entirely defined by a single configuration file with well-documented configuration options for every component. When an experiment is started, its current configuration is stored alongside the model to enable reproducibility and analysis.
Profiling and analysis. LIBKGE performs extensive logging during experiments and monitors performance metrics such as runtime, memory usage, training loss, and evaluation metrics. Additionally, specific monitoring of any part of the KGE pipeline can be added via a hook system. The logging is done in both human-readable form and in a machine-readable format.
Ease of use. LIBKGE is designed to support the workflow of researchers by convenient tooling and easy usage with single line commands. Each training job or hyperparameter search job can be interrupted and resumed at any time. For tuning of hyperparameters, LIBKGE supports grid search, quasi-random search and Bayesian Optimization. All implementations stay in the realm of Python/Py-Torch/Numpy and aim to be as efficient as possible.
LIBKGE supports the needs of researchers who want to investigate new components or improvements of the KGE pipeline. The strengths of LIBKGE enabled a comprehensive study that provided new insights about training KGE models . For an overview about usage, pretrained models, and detailed documentation, please refer to LIBKGE's project page. In this paper, we discuss the key principles of LIBKGE.

Modularization and extensibility
LIBKGE is highly modularized, which allows to mix and match training methods, models, and evaluation methods (see Figure 1). The modularization allows for simple and clean ways to extend the framework with new features that will be available for every model.
For example, LIBKGE decouples the Relation-alScorer (the KGE scoring function) and KgeEmbedder (the way embeddings are obtained) as depicted in Figure 1. In other frameworks, the embedder function is hardcoded to the equivalent of LIBKGE's LookupEmbedder, in which embeddings are explicitly stored for each entity. Due to LIBKGE's decoupling, the embedder type can be freely specified independently of the scoring function, which enables users to train a KGE model with other types of embedders. For example, the embedding function could be an encoder that computes an entity or relation embedding from textual descriptions or pixels of an image (Pezeshkpour et al., 2018;Broscheit et al., 2020, inter alia

Configurability and reproducibility
Reproducibility is important, which means that configuration is important. To enable reproducibility, it is key that the entire configuration of each experiment be persistently stored and accessible. While this sounds almost obvious, the crux is how this can be achieved. Typically, source code can and will change. Therefore, to make an experiment in a certain setting reproducible, the configuration for an experiment has to be decoupled from the code as much as possible.
In LIBKGE all settings are always retrieved from a configuration object that is initialized from configuration files and is used by all components of the pipeline. This leads to comprehensive configuration files that fully document an experiment and make it reproducible as well.
To make this comprehensive configurability feasible-while also remaining modular-LIBKGE includes a lightweight import functionality for configuration files. In Figure 2, we show an (almost) minimal configuration for an experiment for training a ComplEx KGE model (Trouillon et al., 2016). The main configuration file my experiment.yaml in Figure 2 will automatically import the model-specific configuration complex.yaml, which in turn imports the configuration lookup embedder.yaml. The latter defines the default configurations of the LookupEmbedder for entities and relations, which associates every entity and relation identifier with its respective embedding. All configurations are merged into a single configuration object. During merging, the settings in the main configuration file always have precedence over the settings from imported files. The resulting single configuration will be automatically saved in the experiment directory along with the checkpoints and the log files.
As an example of how configurability also helps modularization, we come back to the example of switching the LookupEmbedder with an encoder that computes entity embeddings from string tokens. For this purpose, one may implement a TokenPoolEmbedder. The simple changes to the configuration that uses the new embedder type are demonstrated in Figure 3  the currently known best practices, LIBKGE also includes-and makes configurable-some settings that might not be considered best practice, e.g., different tie breaking schemes for ranking evaluations (Sun et al., 2020). Therefore, with regards to configurability, the goal is not only that the frame- By setting the keys search.device pool and search.num workers in lines 3 and 4 the execution of the trials is parallelized to run 4 parallel trials distributed over two GPU devices. work reflects best practices, but also reflects popular practices that might influence ongoing research.

Hyperparameter optimization
Hyperparameter optimization is crucial in empirically investigating the impact of individual components of the KGE pipeline. LIBKGE offers manual search, grid search, random search, and Bayesian Optimization; the latter two provided by the hyperparameter optimization framework Ax. 2 In this context, LIBKGE further benefits from its configurability because everything can be treated 2 https://ax.dev/ as a hyperparameter, even the choice of model, score function, or embedder. The example in Figure 4 shows a simple hyperparameter search with an initial quasi-random search, and a subsequent Bayesian Optimization phase over the learning rate, batch size and negative samples for the ComplEx model. The trials during the quasi-random search are independent, which can be exploited by parallelizing their runs over multiple devices. In this way, a comprehensive search over a large space of hyperparameters can be sped up significantly (also shown in the example; for more details, please refer to the documentation).

Profiling and metadata analysis
LIBKGE provides extensive options for profiling, debugging, and analyzing the KGE pipeline. While most frameworks print the current training loss and some frameworks also record the validation metrics, LIBKGE aims to make every moving part of the pipeline observable. Per default, LIBKGE records during training things such as runtimes, training loss and penalties (e.g., from the entity and relation embedders), relevant meta data such as the PyTorch version and the current commit hash, and dependencies between various jobs. We show an example logging output during training one epoch in Appendix B. For more fine-grained logging, LIBKGE also can log at the batch level.
During evaluation, the framework records many variations of the evaluation metrics, such as grouping relations by relation type, relation frequency, head or tail. Additionally, users can extract and add information by adding their custom function to one of multiple hooks that are executed before and after all relevant calls in the framework. In this way, users can interact with all components of the pipeline, without risking divergence from LIBKGE's master branch.
Finally, LIBKGE provides convenience methods to export (subsets of) the logged meta data into plain CSV files.

Comparison to other KGE Projects
In this section, we compare LIBKGE to other open source software (OSS) that provides functionality around training and evaluating KGE models for link prediction. The assessments are a snaphot taken at the end of May 2020. All model-specific comparisons have been evaluated w.r.t. the Com-plEx model, which is supported by all projects. Logging denotes the number of metadata keys that are logged per epoch for training and evaluation in a machine readable format for later analysis. Hyperparameter optimization shows if the project supports grid search, random search and Bayesian Optimization. Resume denotes the feature to resume hyperparameter search or training from checkpoints at any time. Active is the amount of commits to the master branch in the last 12 months.
In Table 1, we provide an overview of other KGE projects (full references in Appendix C) and compare them w.r.t. configurability and ease of use. We mainly included projects that could be considered as a basis for a researcher's experiments because they are active, functional, and cover at least a few of the most common models. All projects can be extended with models, losses, or training methods. Large-scale projects and paper code projects-in comparison to more holistic frameworks-typically have a more narrow scope, e.g., they often do not feature hyperparameter optimization. Large-scale projects are typically tailored towards parallelizing training methods and models.
The focus on configurability and reproducibility in LIBKGE is reflected by the large amount of configurable keys. For example, in contrast to other projects, LIBKGE does not tie the regularization weights of the entity and relation embedder to be the same. For entity ranking evaluation, only LIBKGE and PyKeen transparently implement different tie breaking schemes for equally ranked entities. This is important, because evaluation under different tie breaking schemes can result in differences of ≈ .40 MRR in some models and can lead to misleading conclusions, as shown by Sun et al. (2020). OpenKE, for example, only supports the problematic tie breaking scheme named TOP by Sun et al. (2020). LIBKGE and PyKeen are the only frameworks that provide machine-readable logging. Only LIBKGE offers resuming from a checkpoint for training and hyperparameter search. LIBKGE, Ampligraph, and PyKeen are the most active projects in terms of amount of commits during the past 12 months.
Efficiency In Table 2, we show a comparison of KGE frameworks in terms of time for one full training epoch. The configuration setting was chosen such that it was supported by all frameworks, and also facilitates to demonstrate behaviour under varying load. We translate the configurations faithfully to each framework, ensuring that total number of embedding parameters per batch are the same for each framework. Most projects, including LIBKGE, can handle small numbers of negative samples efficiently, but LIBKGE seems to scale   Predictive performance. In Table 3, we collected the reported performances for ComplEx on the dataset FB15K-237 (Toutanova and Chen, 2015). The numbers are not comparable due to different amount of effort to find a good configuration 3 , but they reflect the performance that the framework authors achieved in their experiments.
The results show that with LIBKGE's architecture and hyperparameter optimization a state-of-the-art result can be achieved. For more results obtained with LIBKGE and an in-depth analysis of the impact of hyperparameters on model performance we refer to the study by Ruffinelli et al. (2020).

Conclusions
In this work, we presented LIBKGE, a configurable, modular, and efficient framework for reproducible research on knowledge graph embedding models. We briefly described the internal structure of the framework and how it facilitates LIBKGE's goals. The framework is efficient and yields stateof-the-art performance. We hope that LIBKGE is a helpful ingredient to gain new insights into knowledge graph embeddings, and that a lively community gathers around this project to improve and extend it further.