DBee: A Database for Creating and Managing Knowledge Graphs and Embeddings

This paper describes DBee, a database to support the construction of data-intensive AI applications. DBee provides a unique data model which operates jointly over large-scale knowledge graphs (KGs) and embedding vector spaces (VSs). This model supports queries which exploit the semantic properties of both types of representations (KGs and VSs). Additionally, DBee aims to facilitate the construction of KGs and VSs, by providing a library of generators, which can be used to create, integrate and transform data into KGs and VSs.


Introduction
Many AI tasks can be summarised into the cycle of collecting data, overlaying a representation (schema) on the top of the data and performing learning and inference algorithms, which will eventually produce new data or extend the representation. While in many cases learning and inference are put at the centre of the stage, managing the data and the supporting representations are fundamental parts of the design and delivery of an AI system.
Currently, the prevalence of workflow architectures for many types of AI systems reflects the emphasis on learning and inference, where data management becomes a secondary concern. However, complex AI tasks such as Question Answering (QA) (Kumar et al., 2016), Text Entailment (Hashimoto et al., 2016) or Natural Language Inference are either directly dependent on or can benefit from the construction of supporting Knowledge Bases.
Recently, latent and explicit semantic representations are emerging as fundamental elements for supporting those tasks, due to their dependency on commonsense and domain specific knowledge. Moreover, the recent rise of successful approaches operating at the neuro-symbolic representation level (Parisotto et al., 2016;, demands for a closer dialogue between explicit and latent models. Word embeddings (Mikolov et al., 2013) and Knowledge Graphs (lexico-semantic graphs) (Bollacker et al., 2008) are becoming the de-facto representation models within different AI tasks. Moreover, they have complementary properties, where word embeddings provide more coarsegrained semantics which are complemented by the fine-grained semantics of KGs being commonly used in coordination (Silva et al., 2018;Xie et al., 2017). This paper describes DBee, a database for creating, querying and consuming embeddings and knowledge graphs. DBee aims to be a database designed for satisfying recurring demands from AI applications. DBee provides a seamless layer to jointly query knowledge graphs and embeddings, simultaneously exploiting the semantic properties of both resources, taking into account performance and scalability aspects. At the centre of the proposed database is the goal of bridging the gap between data, representation, learning and inference algorithms, where classifiers and extractors directly interface with the schema. By design, DBee provides a declarative layer for data and representation management in AI systems. Finally, DBee also supports the combination of different models and representations (cross-model and cross-representation queries) and their customisation.
In the following sections of the paper we motivate our approach with an initial scenario, discuss background and and related work, describe the proposed framework, present the implemented system by instantiating it for archetypal use cases and conclude with a discussion outlining the expected performance, hardware requirements and current limitations of the system.

Motivational Scenario
An AI application engineer wants to build a QA system to support investors in NASDAQ companies. Most of the data relevant for this task such as finan- cial reports, blog articles and recent news only exist in the form of unstructured text. The engineer also anticipates the benefits of integrating structured Knowledge Graphs such as DBpedia (Auer et al., 2007), integrating KGs to the textual data sources. Realizing the importance of his application to be able to generate traceable and explainable answers, he decides to use an explicit internal representation, such as the graph-based RDF-NL . With the associated chain of classifiers and extractors available at DBee, he performs openinformation extraction (OIE), Coreference Resolution (CR), Entity Linking (EL) and Rhetorical Structure Classification (RSC) to obtain the graph from a chosen set of documents. After the extraction, the graph is indexed to ensure efficiency for different types of queries over the graph representation. In order to support semantic approximation during the queries, he associates two pre-trained word embedding models to the KG using the DBee API and uses the provided set of query primitives to query the knowledge graph. Deciding to use triples from the KG as features for a neural stock predictor model, he uses the DBee API to create input and answer sets (for a set of pre-defined queries) ready to be consumed by the automatic differentiation framework of his choice.

Background & Related Work
Current machine learning systems such as Keras (Chollet et al., 2015), and PyTorch (Paszke et al., 2017) focus mostly on exposing their user to the definition of neural architectures, abstracting away the computation details of automatic differentiation -or trying to learn even those (Jin et al., 2018) -with TensorFlow (Abadi et al., 2016) being the most complete suite providing assistance from data streaming, over training to model serving. Our approach can be seen as complementary to these efforts since we aim to provide the infrastructure to extract, represent and query structured and unstructured data (with an emphasis on KGs from text and associated embeddings).
Early efforts in a similar direction include (Sales et al., 2018), that present a uniform service-based API for storing, querying and comparing word embeddings, pre-computed with varying models and on different datasets. Another information management tool for unstructured data is Apache UIMA 1 .
Contemporary machine comprehension systems based on neural architectures have targeted evaluation settings which have limited document scale (e.g. SQUAD (Rajpurkar et al., 2016)).
Different works explored the connection be-tween distributional semantics and structured Knowledge Graph representations in the context of semantic parsing over large-scale RDF graphs Freitas, 2015;Sales et al., 2016) and approximative abductive reasoning over commonsense KBs (Freitas et al., , 2013. Comparatively, DBee focuses on explicit semantic representation models (Knowledge Graphs) extracted from text.

Proposed Framework
To satisfy the emerging need to work with unstructured text representations, as depicted in the introductory part of this work, we propose a framework that supports the extraction and management of both explicit and latent text representation models and facilitates the integration with downstream machine learning based models. DBee was designed to deliver the following features: 1. Bridging the gap between unstructured data and semantic representations: Conforming data into latent and explicit text representations is a primary requirement for many AI applications. DBee allows users to create, reuse and compose a library of text extractors and classifiers which will be used to structure and integrate existing unstructured data. The library includes standard representation generators such as syntactic and lexical parsers, open information extractors, named entity recognisers and linkers and discourselevel extractors.
2. Multi-representation model: DBee supports users in experimenting with different types of explicit and latent semantic representations and models. Different tasks will require different types of representation. Users should be able to query across multiple representations.
3. Expressive structured queries and ML integration: To give its users fine-grained control over the data and to overlay their own machine learning algorithms, DBee features an intuitive query language and seamless integration with existing machine learning algorithms.
4. Extensibility: Representation schemas and their supporting generators are extensible and customisable.
5. Scalability: Operating over large-scale data sources, large knowledge graphs and embeddings require principled query processing strategies. DBee inherits indexing strategies from databases and kNN embedding queries in order to support scaling to large datasets, memory footprints and storage space requirements. DBee operates over two types of representation: knowledge graphs and word embeddings. The underlying knowledge graph data model uses RDF-NL, an extension of the RDF (Lassila and Swick, 1999) data model suitable to represent text as a lexico-semantic Knowledge Graph. RDF-NL is built upon a sentence representation model proposed by (Niklaus et al., 2017(Niklaus et al., , 2019 which splits complex sentences into simpler linked clausal and phrasal elements, later splitting these elements into predicate-argument structures. The graph data model (Figure 2 (a)) is defined by a subject-predicate-object (SPO) triple which can have contextual relations (C) as reifications or can be linked to other SPO triples. Contextual links can be named. This data model supports the creation of versatile sparse graph representations. For example, the data model smoothly captures linguistic predicate-argument structures and phrasal (e.g. appositive), clausal (coordination and subordination), rhetorical and argumentation relations. Figure 3 shows an example of a concrete knowledge graph extracted from a sentence.
All SPO nodes are defined by their lexical realisation (typically a text chunk) and can be linked to a canonical identifier in the entity component of the data model (Figure 2 (b)), which allows an entity-centric data integration, such as it is performed by co-reference resolution, entity linking or word-vector clustering.
The data model is materialised into different types of supporting indexes in order to enable efficient and scalable query processing. There are two main types of indexes associated with the data model: • Embedding Indexing ( E I): Supports the efficient querying of embedding spaces (k-NN similarity queries). By default it uses the random projections of the locality-sensitive hashing method proposed by (Charikar, 2002).
• Knowledge Graph Lexical Indexing ( T I): Supports informationretrieval style keyword search queries over the KG structure using inverted indexes and associated weighting schemes (by default, TF-IDF is used).

Operations
At the centre of the DBee data model is the ability to build, transform and combine KGs and Vector Spaces/Embeddings (VSs). Different KGs and VSs can be combined using a view allowing support of querying specific compositions. Projections ( π) are operators which build VSs (embeddings) from KGs and unstructured datasets.
On the top of the views, query operators are defined. View, projections and queries can be chained together. If called in the middle of a function chain, these functions serve the purpose of a join in a sense similar to relational databases. This means views and projections later in the operation chain will only operate on the subset of results satisfying the query defined in the chain so far. This behaviour is visualized in Figure 5.
The domain-specific query language (DSL) associated with DBee includes the following functions: • query(term, n): This operation retrieves up to n best matching candidates from the view/projection it is being executed upon with respect to its type. For a projection, for example, it retrieves n nearest neighbours regarding their embeddings, after embedding the query term using the projection space's corresponding embedding function.
• filter(attribute=value | bgp | conditional statement): This operation filters are selection operators (σ) for predicates defined as the function's parameters. They can be defined as an attribute=value form or with the help of basic graph patterns).
• rank(wrt): This function can be used to rank a set of results with respect to a given term by their distance to it in the corresponding vector space.
• top(n), count(): Aggregation operators will retrieve the top results in up to a given limit or count them up, respectively. Provided a name, the result set will consist of attributes of this name.
• create view(name): Creates a new view with a given name from the current result set.
• create projection(name, using, features): Similarly, creates a new projection using a given embedding function by extracting the features from every result in the set. Features might be defined simply by providing a list of attribute names to use or any callable operation to extract custom features.

Representation Generators (g) & Chains (c)
Representation elements have associated generators g, which are classifiers, extractors and linkers which operate over Data, KGs or VSs.
The generators are stored into libraries, typed according to their representation function and tagged with the model metadata (such as training corpus and evaluation score, architecture and hyperparameter configuration). Generators can be composed using generator chains. For example, generating a KG from textual data would typically employ the chain: g CR • g EL • g OIE RDF −N L . As with the generators, chains can be named and persisted into libraries.
Generators can also be associated with vector representations, e.g. g V S W 2V . The set of pre-defined generators currently present at DBee are described in Table 1. Figure 1 summarises the main primitives of the system depicting a schematic high-level components diagram of DBee.
Concretely, we propose a pipeline with the following steps: First, using contextualised open information extraction , structured information is extracted from the unstructured text, in the form of a set of inter-linked subject-predicateobject triples, thus yielding a graph. With coreference resolution, the graph is further enriched semantically, linking nodes that refer to the same Entity Linker to the resource X g N ER Named Entity Recognizer g OIE Open Information Extractor g π provider of an embedding function pi entity in the text. In a final entity linking step, recognised entities are connected to existing resources, contextualising them further regarding existing background knowledge. The extracted knowledge graph is then serialised and indexed, while still retaining its logical graph representation. In particular, we use full-text search capable databases and nearest neighbour indices to enable querying and approximation of stored data using string-based as well as embedding-based methods.
The API layer features a chainable IDSL to allow intuitive interaction with the data. Concretely it is designed to support expressive recurring query patterns while reducing impedance mismatch.

Implementation
DBee was conceptualised and implemented as an extensible Python library. We use the HOCON 2 format to enable for easy generator chain definition and persistence. We provide a pre-defined chain featuring the contextualised open information extraction tool Graphene , the Stanford CoreNLP coreference resolution system (Manning et al., 2014) and the entity linker Spotlight (Mendes et al., 2011) that links recognised entities to DBpedia resources. Furthermore, we use ElasticSearch 3 as the full-text search engine and Annoy 4 to index the embeddings for the kNN queries.
Note, that following the design goals of extensibility and scalability, the software is not constrained to use those specific tools. Even the choice of generators and storage types is not fixed, as it requires low effort to add a new generator or storage type, such as a relational database to perform joins more efficiently, for instance.

DBee in Code
Listed below are example instantiations illustrating the usage of the framework exercised on four exemplar use cases.

Extraction
Code 1 shows the boilerplate code required to instantiate DBee. From a list of Wikipedia article titles one can query the Wikipedia API and apply (2) (3) highlights the definition of a generator as one step of the chain, using HOCON syntax, with semantics similar python's logging module configuration.

Index Creation
From the KG extracted in the previous step, the storage indices can be populated. The DSLlevel user does not necessarily need to know kb('nasdaq-100').
using(docs.get_iterator('fact')). create_view('fact') kb('nasdaq-100'). view('spo'). create_projection ('po', pi=IndraEmbedder, features=['predicate', 'object']) Code Snippet 2: Example of data Storage the actual name of the data view ('fact' in this case) but can obtain it by querying the class of the corresponding generator type (i.e. OIEGenerator.provides). Similarly, additional indices can be constructed and stored from the representation generated by an already defined chain or indices -as shown in the example, by utilising one of the pre-trained embedding generators provided by DBee.

Querying
Code 3 shows the query equivalent to the natural language query Which companies have offices in China?. The query describes the process of filtering the list of initial entities to retain only those of the type "company", switch the data view to facts (performing a join implicitly), further filtering out facts, and finally projecting the remaining entities into the previously created vector space and ranking them by distance to the computed projection of a given term. An implicit join back to the textual view is made to retrieve the subjects of the re-ordered remaining facts. Note that the query uses already resolved filter predicates for brevity, one could likewise use the operation view('types').query ('companies') to query for the concrete type URI using the expression obtained from the text -given the view was constructed beforehand.

Code Snippet 4: Dataset Creation Example
Finally, the example in Code 4 shows the creation of a toy dataset for link type prediction between two interlinked facts. The user-defined defined bow and onehot functions serve as feature extractors.

Analysis of T F in
While this is not meant to be a thorough empirical analysis, the following section gives insights into All of the following measurements were carried out on a notebook featuring an SSD, a dual-core i5-6200U CPU performing at 2.3GHz and 16 GB of RAM.
We provide the vector space indexing times and sizes for a varying number of indexed vectors, averaged over ten runs. In particular, we index 10 n , n ∈ 0..6 embedding vectors using our vector space storage implementation based on Annoy. The dimension of the embeddings is 300. For full-text search indices, we performed the same procedure. We populated ElasticSearch indices with 10 n , n ∈ 0..6 facts, denormalised according to the data model, using a single local node. Figures 6 and 7 shows the result, revealing that indexing times and index sizes scale linearly with the size of the dataset. Axes within the plots are at logarithmic scale.
The creation of a small dataset from 100 Wikipedia articles yielded 2292 recognised entities, 22633 distinct subject-predicate-object structures and 39456 contextual links. The total required storage space was 8.159 MB for the denormalised textual data stored in ElasticSearch and 148.37 MB for the stored 300 dimensional word2vec embeddings. Running all steps in sequence -from obtaining the documents up to storing them in coresponding indices -took approximately 3.5 hours.
It is worth noting that the open information step takes up most of the processing time. However, since by design the extraction process does not require to explore any dependencies between different documents, a speedup factor of up to n can be assumed for n parallel instantiations of the extraction pipeline.

Current Limitations & Future Work
In its current version, the software does not support data insertion or updates due to an implementation detail of choosing the annoy implementation for nearest neighbour approximation, in favour of its speed. There are, however, recent approaches for nearest neighbour estimation that support dynamic index updates (Li and Malik, 2017).
Furthermore, the current approach requires different tools to store different data representation types such as views and projections. One future direction is to investigate how to build low-level vector space index support into existing DBMS.
Finally, a rigorous analysis regarding scalability, complexity, performance and usability will be carried out in the future.

Conclusion
In this paper, we formalised an approach to create and manage knowledge graphs and embeddings and to query them jointly and introduced DBee, a system implementing this approach. We hope to provide the community with a tool that facilitates the management, storage and querying of latent and explicit text representation facilitating its integration to downstreal ML/AI applications.