An Introduction to Knowledge Graphs

A comprehensive guide to Knowledge Graphs and their applications

35 min read

In an era defined by data, making sense of complex relationships and vast information repositories has become critical for organizations. are emerging as a powerful tool to tackle these challenges. By organizing data into a network of interconnected entities and relationships, knowledge graphs provide a structured way to represent information and extract insights.

In traditional data systems, information is often stored in siloed databases, making it difficult to extract meaningful insights. For instance:

  • Disconnected Data: Systems that store data independently struggle to provide a unified view.
  • Poor Relationship Representation: Conventional databases often fall short in modeling complex interconnections between entities.
  • Search Limitations: Searching across structured and unstructured data seamlessly is challenging.
In his presentation , Mike Bergman defines the nature of the world as messy, complicated, interconnected, diverse and everychanging. As a result, our knowledge of this world is never complete, exists in structured, semi-structured and unstructured formats and can be found everywhere. This knowledge is contextual and MUST be coherent! This is the world we live in and the world we are trying to model with Knowledge Graphs.

A knowledge graph addresses these problems by integrating data from various sources into a unified graph structure. It represents as (nodes) that connect to each other through various relationships (edges) in a format that is both human-readable and machine-interpretable. This framework acts as a database, enabling complex queries by understanding the context and connections between various pieces of information. Knowledge graphs enhance AI applications by improving information retrieval and reasoning capabilities across multiple data sources powering tasks like semantic search, recommendation systems, and more.

What is a Knowledge Graph (KG)?

The term knowledge graph has been used frequently in research and business, usually in close association with Semantic Web technologies, linked data, large-scale data analytics and cloud computing. The term "knowledge graph" is often mistakenly thought to have originated in 2012, when Google adopted it to describe its structured entity-attribute information, prominently featured on its search results pages. While Google's use of the term has significantly boosted its visibility and marketing appeal, the concept dates back much further. The phrase "knowledge graph" itself can be traced to the 1970s, and the underlying ideas go back even earlier.

A knowledge graph, also known as a semantic network, represents a network of real-world entities—such as objects, events, situations or concepts—and illustrates the relationship between them. This information is usually stored in a graph database and visualized as a graph structure, prompting the term knowledge "graph."1

There has been lots of efforts to clearly define but to put it simply; a knowledge graph is a network-based representation of knowledge that organizes data from multiple sources and captures information about entities of interest and the relationships between them. They are:

  • Graphs: unlike knowledge bases, the content of KGs is organised as a graph, where nodes (entities of interest and their types), relationships between and attributes of the nodes are equally important. This makes it easy to integrate new datasets and formats and supports exploration by navigating from one part of the graph to the other through links.
  • Semantic: the meaning of the data is encoded for programmatic use in an ontology, which describes the types of entities in the graph and their characteristics and can be represented as a schema sub-graph. This means that the graph is both a place to organise and store data, and to reason what it is about and derive new information.

Knowledge Graphs consists of:

  • Nodes: Representing entities like people, places, things, or abstract concepts
  • Edges: Connections between nodes showing relationships
  • Labels: Attributes that define the relationships and reasoning rules

At its core, a knowledge graph is data structure that connects data in a semantic way, allowing both humans and machines to understand the context and meaning of the information. Some live examples of Knowledge Graphs are the that powers search results with contextual insights and that models professional connections and job market trends.

DefinitionSource
"A knowledge graph (i) mainly describes real-world entities and their interrelations, organized in a graph, (ii) defines possible classes and relations of entities in a schema, (iii) allows for potentially interrelating arbitrary entities with each other and (iv) covers various topical domains."
Paulheim 1
"Knowledge graphs are large networks of entities, their semantic types, properties, and relationships between entities."
Journal of Web Semantics 2
"Knowledge graphs could be envisaged as a network of all kinds of things which are relevant to a specific domain or to an organization. They are not limited to abstract concepts and relations but can also contain instances of things like documents and datasets."
Semantic Web Company 3
"We define a Knowledge Graph as an graph. An graph consists of a set of triples where each triple (s,p,o)(s, p, o) is an ordered set of the following terms: a subject sUBs \in U \cup B, a predicate pUp \in U, and an object oUBLo \in U \cup B \cup L. An term is either a URI uUu \in U, a blank node bBb \in B, or a literal lLl \in L."
Färber et al. 4
"[...] systems exist, [...], which use a variety of techniques to extract new knowledge, in the form of facts, from the web. These facts are interrelated, and hence, recently this extracted knowledge has been referred to as a knowledge graph."
Pujara et al. 5
Table 1: Selected definitions of knowledge graph - Towards a Definition of Knowledge Graphs 6

Different Types of Information Management Systems

To appreciate the uniqueness of knowledge graphs, it’s helpful to understand how they compare to other :

  • : stores data in a row-based table structure which connects related data elements An RDBMS includes functions that maintain the security, accuracy, integrity and consistency of the data.
  • : de-normalised data stores that allow for analytical activities like count, aggregation, etc.
  • : data storage paradigm designed for storing, retrieving, and managing associative arrays , and a data structure more commonly known today as a dictionary or hash table.
  • : stores data tables by column rather than by row. Benefits include more efficient access to data when only querying a subset of columns (by eliminating the need to read columns that are not relevant), and more options for data compression
  • : uses graph structures for semantic queries with nodes, edges, and properties to represent and store data
  • : data storage system designed for storing, retrieving and managing document-oriented information, also known as semi-structured data
  • : optimized for handling time series data, i.e., data points indexed in time order
  • : supports multiple data models against a single, integrated backend

Under the hood

The key difference between a graph and relational database is that relational databases work with sets while graph databases work with paths. This manifests itself in unexpected and unhelpful ways for a Relational Database Management System (RDBMS) user.

For example when trying to emulate path operations (e.g. friends of friends) by recursively joining in a relational database, query latency grows unpredictably and massively as does memory usage, not to mention that it tortures SQL to express those kinds of operations. More data means slower in a set-based database, even if you can delay the pain through judicious indexing.

Most graph databases don't suffer this kind of join pain because they express relationships at a fundamental level. That is, relationships physically exist on disk and they are named, directed, and can be themselves decorated with properties (the property graph model). This means if you chose to, you could look at the relationships on disk and see how they "join" entities. Relationships are therefore first-class entities in a graph database and are semantically far stronger than those implied relationships reified at runtime in a relational store.

tldr;

  1. Graph databases are much faster than relational databases for connected data - a strength of the underlying model. A consequence of this is that query latency in a graph database is proportional to how much of the graph you choose to explore in a query, and is not proportional to the amount of data stored, thus defusing the .
  2. Graph databases make modelling and querying much more pleasant meaning faster development

How to determine if Knowledge Graphs are what you need?

1. Is your Data Highly-Connected?

Graph solutions are focused on highly-connected data that comes with an intrinsic need for relationship analysis. If the connections within the data are not the primary focus and the data is of a transactional nature, then a graph database is probably not the best fit.

2. Is Retrieving the Data more Important than Storing it?

Graph databases are optimized for data retrieval and you should go with the graph database if you intend to retrieve data often. If your focus is on writing to the database and you’re not concerned with analyzing the data, then a graph database wouldn’t be an appropriate solution. A good rule of thumb is, if you don’t intend to use JOIN operations in your queries, then a graph is not a must-have.

3. Does your Data Model Change Often?

If your data model is inconsistent and demands frequent changes, then using a graph database might be the way to go. Because graph databases are more about the data itself than the schema structure, they allow a degree of flexibility.

On the other hand, there are often benefits in having a predefined and consistent table that’s easy to understand. Developers are comfortable and used to relational databases and that fact cannot be downplayed.

For example, if you are storing personal information such as names, dates of birth, locations… and don’t expect many new fields or a change in data types, relational databases are the go-to solution. On the other hand, a graph database could be useful if:

  • Additional attributes could be added at some point,
  • Not all entities will have all the attributes in the table and
  • The attribute types are not strictly defined.

Graphs as data structures

A Graph is a non-linear data structure consisting of vertices and edges. The vertices are sometimes also referred to as nodes and the edges are lines or arcs that connect any two nodes in the graph. More formally, a knowledge graph as a directed labeled graph is a 4-tuple G=(N,E,L,f)G = (N, E, L, f), where NN is a set of nodes, EN×NE \subseteq N \times N is a set of edges, LL is a set of labels, and f:ELf: E \to L is an assignment function from edges to labels. An assignment of a label BB to an edge E=(A,C)E = (A, C) can be viewed as a triple (A,B,C)(A, B, C).

Types of Graphs

  • Null Graph: A graph is known as a null graph if there are no edges in the graph
  • Trivial Graph: Graph having only a single vertex, it is also the smallest graph possible
  • Undirected Graph: A graph in which edges do not have any direction. That is the nodes are unordered pairs in the definition of every edge.
  • Directed Graph: A graph in which edge has direction. That is the nodes are ordered pairs in the definition of every edge.
  • Labeled Graph: A graph where edges are labelled (can have properties for the relationships.
  • Connected Graph: The graph in which from one node we can visit any other node in the graph is known as a connected graph.
  • Disconnected Graph: The graph in which at least one node is not reachable from a node is known as a disconnected graph.
  • Regular Graph: The graph in which the degree of every vertex is equal to K is called K regular graph.
  • Complete Graph: The graph in which from each node there is an edge to each other node
  • Cycle Graph: The graph in which the graph is a cycle in itself, the degree of each vertex is 2.
  • Cyclic Graphs: A graph containing at least one cycle is known as a Cyclic graph.
  • Directed Acyclic Graph: A Directed Graph that does not contain any cycle.
  • Bipartite Graph: A graph in which vertex can be divided into two sets such that vertex in each set does not contain any edge between them.
  • Weighted Graph: A graph in which the edges are already specified with suitable weight is known as a weighted graph. Weighted graphs can be further classified as directed weighted graphs and undirected weighted graphs.

you can have a mix of types between those types e.g., Directed Cyclic Graphs, Directed Labeled Cyclic Graphs, Directed Labeled Cyclic Multigraph, etc.

🌲 Trees are the restricted types of graphs, just with some more rules. Every tree will always be a graph but not all graphs will be trees. Linked ListTrees, and Heaps all are special cases of graphs.

Graphs as data models

Directed edge-labelled graphs

A directed edge-labelled graph (or multi-relational graph 7 8 9) is a set of nodes connected by directed, labelled edges. In knowledge graphs, nodes represent entities, and edges represent relationships between them.

Key Features

  • Flexible Data Representation: Graphs allow for integrating new data sources more flexibly than relational databases, which require predefined schemas. Unlike hierarchical data models (e.g., XML, JSON), graphs allow cycles and avoid rigid hierarchical structuring.
  • Bidirectional Edges: For clarity, bidirectional edges can represent two directed edges.
  • Incomplete Data: Missing information can simply be omitted, such as when the graph lacks start/end dates for an event.

A standardised data model based on directed edge-labelled graphs is the which has been recommended by the for representing knowledge graphs on the web. The model defines different types of nodes, including which allow for global identification of entities on the Web; literals, which allow for representing strings (with or without language tags) and other datatype values (integers, dates, etc.); and blank nodes, which are anonymous nodes that are not assigned an identifier.

Everything in an graph is called a resource. “Edge” and “Node” are just the roles played by a resource in a given statement. Fundamentally in , there is no difference between resources playing an edge role and resources playing a node role. An edge in one statement can be a node in another. We will give examples of this in the diagrams that follow that will make this core idea clearer.

There is a standard query language for Graphs called . It is both, a full featured query language and an HTTP protocol making it possible to send query requests to endpoints over HTTP. A key part of the standard is the definition of serializations. The most commonly used serialization format is called Turtle. There is also a JSON serialization called as well as an XML serialization. All databases are able to export and import graph content in standard serializations making it easy and seamless to interchange data.

A directed edge-labelled graph is a tuple G=(V,E,L)G = (V, E, L), where VConV \subseteq \text{Con} is a set of nodes, LConL \subseteq \text{Con} is a set of edge labels, and EV×L×VE \subseteq V \times L \times V is a set of edges.

Heterogeneous Graphs

A heterogeneous graph 10 11 12 (or heterogeneous information network 13 14) is a directed graph where each node and edge is assigned one type. Heterogeneous graphs are similar to directed edge-labelled graphs, with edge labels corresponding to edge types, but they also include node types as part of the graph model.

An edge is called homogeneous if it connects two nodes of the same type and heterogeneous if it connects nodes of different types. Heterogeneous graphs allow partitioning nodes by their type, which is useful for machine learning tasks 10 11 12.

In contrast, directed edge-labelled graphs support a more flexible model where nodes can have zero or multiple types.

A heterogeneous graph is a tuple G=(V,E,L,l)G = (V, E, L, l), where VConV \subseteq \text{Con} is a set of nodes, LConL \subseteq \text{Con} is a set of edge/node labels, EV×L×VE \subseteq V \times L \times V is a set of edges, and l:VLl : V \rightarrow L maps each node to a label.

Property Graphs

While there are core commonalities in property graph implementations, there is no true standard property graph data model. Each implementation of a Property Graph is, therefore, somewhat different. The following discusses the characteristics that are common for any property graph database.

Generally, the property graph data model consists of three elements:

  • Nodes: The entities in the graph. Nodes can be tagged with zero to many text labels representing their type. Nodes are also called vertices.
  • Edges: The directed links between nodes. Edges are also called relationships. The “from node” of a relationship is called the source node. The “to node” is called the target node. Each edge has a type. While edges are directed, they can be navigated and queried in either direction.
  • Properties: The key-value pairs associated with a node or with an edge.

Property values can have data types. Supported data types depend on the vendor. For example, Neo4j data types are similar, but not identical, to Java language data types.

A property graph is a tuple G=(V,E,L,P,U,e,l,p)G = (V, E, L, P, U, e, l, p), where: VConV \subseteq \text{Con} is a set of node IDs, EConE \subseteq \text{Con} is a set of edge IDs LConL \subseteq \text{Con} is a set of labels, PConP \subseteq \text{Con} is a set of properties, UConU \subseteq \text{Con} is a set of values, e:EV×Ve : E \rightarrow V \times V maps an edge ID to a pair of node IDs, l:VE2Ll : V \cup E \rightarrow 2^L maps a node or edge ID to a set of labels, p:VE2P×Up : V \cup E \rightarrow 2^{P \times U} maps a node or edge ID to a set of property–value pairs.

A key part of any data model is having a query language available for working with it. After all, users need to have a way to access and manipulate the data in the graph. No industry standard query language exists for property graphs. Instead, each database offers their own, unique query language that is incompatible with others:

  • offers also known as CQL—its own query language that, to some extent, took SQL as an inspiration.
  • offers GSQL—its own query language that also took SQL as an inspiration.
  • MS SQL Graph has their own extension to SQL to support graph query.
  • Some vendors, in addition to their own query language, also implement some subset of Cypher. For example, SAP Hana offers its own extensions to SQL and its own GraphScript language plus they support a subset of Cypher.

There is also ; which is an open source graph computing framework that is integrated with some property graph and graph databases. It offers the Gremlin language which is more of an API language than a query language.

A key requirement for working with any data model is the ability to reference nodes, properties and relationships (edges). In the case of property graphs, internally, nodes and edges have IDs. IDs are assigned by a database and are internal to a database. Referencing is done by using text strings—node labels, relationship types, and property names.

RDF vs. Property Graph

FeatureProperty Graph
Expressivity
Arbitrary complex descriptions via links to other nodes; no properties on edges out of the box. With the model gets much more expressive than property graphs
Limited expressivity; beyond the basic directed cyclic labeled graph properties (KV pairs) for nodes and edge
Formal Semantics
✅ standards schema and model semantics foster reuse and inference
❌ No formal model representation
Standardisation
Driven by W3C working groups and standardisation processes
Different competing vendors
Query Language
W3C standard
Cypher, PGQLm GCORE, GQL → no standard
Serialisation Formant
✅ Multiple serialisation formats
❌ No serialisation format
Schema Language
✅ , , Shapes
❌ None
Design goal
Linked Data (publishing and linking data with formal semantic and no central control)
Graph representation for analytics
Processing Strengths
Set analysis operations (as in SQL but with schema abstraction and flexibility)
Graph Traversal (plenty of graph analytics and ML Libs)

tldr; The main advantages of RDF

  • The Data Model provides a richer, semantically consistent foundation over property graphs.

  • Text values can also have language tags to support internationalisation of data. For example, instead of a single value for rdfs:label for New York City we could have multiple values such as:

    “New York City” xsd:string @en

    “Nueva York” xsd:string @sp

  • A key differentiator is how the underlying model (schema) is represented in the same way as the data. Just to serve as a primer, rdf:type is a predicate used to connect a resource with a class it belongs to; rdfs:label is used to provide a display name for a resource. The uniformity of the data model makes Graphs more easily evolvable and gives them more flexibility compared to Property Graphs.

  • Enrichment Through Composition: With the inherent composability of Graphs, when two nodes have the same URI, they are automatically merged. This means that you can load different files and their content will be joined together forming a larger and more interesting graph.

  • Having data in standard format allows for the ease of integration with the wealth of Open Data available e.g.m DBpedia, Geonames, Open Corporates, etc.

  • No vendor lock in, its all open source and W3C Standards

So, in the end, what is a Knowledge Graph?

A Knowledge Graph is a connected data structure of data and associated metadata applied to model, integrate and access information assets. The knowledge graph represents real-world entities, facts, concepts, and events as well as the relationships between them. Knowledge graphs yield a more accurate and comprehensive representation of data.

Knowledge Graphs (KGs) have emerged as a compelling abstraction for organising the world’s structured knowledge, and as a way to integrate information extracted from multiple data sources. Knowledge graphs have started to play a central role in representing the information extracted using natural language processing and computer vision. Domain knowledge expressed in KGs is being input into machine learning models to produce better predictions.

The heart of the knowledge graph is a knowledge model – a collection of interlinked descriptions of concepts, entities, relationships and events where:

  • Descriptions have formal semantics that allow both people and computers to process them in an efficient and unambiguous manner;
  • Descriptions contribute to one another, forming a network, where each entity represents part of the description of the entities related to it;
  • Diverse data is connected and described by semantic metadata according to the knowledge model.

Knowledge graphs combine characteristics of several data management paradigms:

  • Database, because the data can be explored via structured queries;
  • Graph, because they can be analysed as any other network data structure;
  • Knowledge base, because they bear formal semantics, which can be used to interpret the data and infer new facts.

💬 “By 2025, graph technologies will be used in 80% of data and analytics innovations, up from 10% in 2021, facilitating rapid decision making across the enterprise” 2

Knowledge graphs have a number of benefits over conventional relational databases and document stores. Specifically:

  • A unified, single source of truth
  • Flexible and highly adaptable data structure
  • Can represent knowledge in any domain
  • Wide range of tooling for data model definition and control
  • Ability to link and enrich data
  • Huge open source library of linked data
  • The perfect playground for virtually all ML tasks

Knowledge graphs, represented in RDF, provide the best framework for data integration, unification, linking and reuse, because they combine:

  • Expressivity: The standards in the Semantic Web stack – , and – allow for a fluent representation of various types of data and content: data schema, taxonomies and vocabularies, all sorts of metadata, reference and master data. The  extension makes it easy to model provenance and other structured metadata.
  • Performance: All the specifications have been thought out, and proven in practice, to allow for efficient management of graphs of billions of facts and properties.
  • Interoperability: There is a range of specifications for data serialization, access ( Protocol for end-points), management ( Graph Store) and federation. The use of globally unique identifiers facilitates data integration and publishing.
  • Standardization: All the above is standardized through the W3C community process, to make sure that the requirements of different actors are satisfied – all the way from logicians to enterprise data management professionals and system operations teams.

What is NOT a Knowledge Graph?

Not every graph is a knowledge graph. For instance, a set of statistical data, e.g. the GDP data for countries, represented in is not a KG. A graph representation of data is often useful, but it might be unnecessary to capture the semantic knowledge of the data. It might be sufficient for an application to just have a string ‘Italy’ associated with the string ‘GDP’ and a number ‘1.95 trillion’ without needing to define what countries are or what the ‘Gross Domestic Product’ of a country is. It’s the connections and the graph that make the KG, not the language used to represent the data.

Not every knowledge base is a knowledge graph. A key feature of a KG is that entity descriptions should be interlinked to one another. The definition of one entity includes another entity. This linking is how the graph forms. (e.g. A is B. B is C. C has D. A has D). Knowledge bases without formal structure and semantics, e.g. Q&A “knowledge base” about a software product, also do not represent a KG. It is possible to have an expert system that has a collection of data organized in a format that is not a graph but uses automated deductive processes such as a set of ‘if-then’ rules to facilitate analysis.

Why Knowledge Graphs are very exciting for ML?

Bringing knowledge graphs and machine learning together will systematically improve the accuracy of the systems and extend the range of machine learning capabilities. We are particularly interested in their applications in:

Data Insufficiency

Having a sufficient amount of data to train a machine learning model is very important. In the case of sparse data, Knowledge Graph can be used to augment the training data, e.g., replacing the entity name from original training data with an entity name of a similar type. This way a huge number of both positive and negative examples can be created using Knowledge Graph.

Zero-Shot Learning

Today, the main challenge with a Machine Learning model is that without a properly trained data it can not distinguish between two data points. In Machine Learning, this is considered as Zero-Shot Learning problem. This is where knowledge graphs can play a very big role. The induction from the Machine Learning model can be complemented with a deduction from the Knowledge Graph, e.g., where the type of situation did not appear in the training data.

Explainability

One of the major problems in machine learning industry is explaining the predictions made by machine learning systems. One issue is the implicit representations causing the predictions from the machine learning models. Knowledge Graph can alleviate this problem by mapping the explanations to some proper nodes in the graph and summarizing the decision-taking process.

Appendix

Graph DBMS vendors

Vendor ProductNative or MultimodelSupported ModelsDeployment PlatformsQuery LanguageSupported Graph Algorithms/LibrariesLicense ModelPricing Model (Nodes, Users, Consumption)
Native
Property,
Cloud
TinkerPop, Gremlin,
TinkerPop
Open source, managed service
On-demand instances, storage, I/O, backups, data transfer
Native
, (property)
On-premises, multicloud, hybrid
OpenCypher,
Built-in
Freemium, subscription
vCPU cores
 and  
Multimodel
Property
On-premises, multicloud
TinkerPop, Gremlin, GraphQL
TinkerPop
Open core
Nodes and consumption
Native
GraphQL, JSON,
On-premises, cloud
DQL, GraphQL
Built-in
Open source, Apache 2.0
CPUs per node
Multimodel
Document, graph (JSON-LD), , ,
On-premises, multicloud, hybrid
SPARQL, SPARQL*, FedShard-Parallel, SPARQL, GraphQL, Prolog/Datalog, Lisp, JIG/Gremlin, domain-specific languages
Built-in
Closed source
CPU cores
Multimodel
Triples
On-premises, multicloud, hybrid
JavaScript, Optic, Search, SPARQL, XQuery
Built-in
Closed source (free developer version)
Cores, consumption (free developer version)
Multimodel
Property
Cloud
Gremlin
Built-in
Open source, Apache 2.0
Throughput capacity, serverless consumption
Native
Property
On-premises, multicloud, hybrid
Cypher (openCypher), GraphQL, RDF/SPARQL, SQL
Built-in
Open core, managed service
SaaS: RAM, consumption; on-premises: machines/cores/RAM
Native
, OWL/
On-premises, multicloud
GraphQL, SPARQL, SPARQL*,SQL
Built-in
Perpetual, subscription, limited free version
per-CPU
Multimodel
On-premises, multicloud, hybrid
SPARQL, SQL
Built-in
Closed source
Concurrent users and CPU affinity, per node
Multimodel
Property,
On-premises, multicloud, hybrid
PGQL,
Built-in
Perpetual, subscription
Perpetual: user/server/enterprise; subscription: consumption
Multimodel
Property
On-premises, multicloud, hybrid
Cypher
LAGraph
Perpetual, subscription
Consumption (RGUs based on memory and throughput)
Multimodel
Property
GraphScript (proprietary), openCypher, SQL, SQLScript
Built-in
Perpetual, subscription
Perpetual: users; subscription: consumption
Native
,
On-premises, multicloud, hybrid
GraphQL, SPARQL, SQL
Built-in
Subscription
Nodes, consumption
Native
Property
On-premises, multicloud, hybrid
Gremlin
Built-in
Freemium, perpetual, subscription
On-premises: cores, connection; subscription: consumption
 
Native
Labeled property
On-premises, multicloud
GSQL
Built-in
Freemium, perpetual, subscription
On-premises: available RAM for data storage; cloud: vcPU, RAM, disk size, I/O

Vendor Profiles

Amazon Web Services (AWS)

, based in Seattle, introduced in 2018 as a cloud-only managed service and claims thousands of active customers. Neptune is a native graph DBMS supporting and the W3C’s models. ACID transactions, in-memory execution, up to 15 read replicas and high availability are part of the service. Instances are priced by the hour and billed per second with no long-term commitments, and with added charges for storage, IOs, backup, and data transfer in and out. The service supports graphs of up to 64TB, encryption-at-rest with customer-managed keys and cross-region snapshot sharing.

Neptune can be queried with W3C’s as well as to build graph applications and implement custom graph algorithms. Open-source tools are available on Github under Apache 2.0 and MIT licenses. AWS is an active contributor to the open-source project.

The , an open-source Apache2 Jupyter Notebook developed by AWS, allows customers to visualize the results of queries run against the graph database and to get started with sample graph applications. supports ML to make predictions over graph data using graph neural networks (GNNs) from the .

Cambridge Semantics

is based in Boston and has offered , a knowledge graph platform that includes the , since 2015 with freemium and enterprise software subscriptions on-premises, and via Cloud Marketplaces in multicloud and hybrid deployment. Pricing for both the platform and engine is based on the number of vCPU cores used by the graph engine.

version 2.2 is a native graph engine supporting and for property graph use cases. Inferencing is performed for and OWL 2 RL ontologies using in-memory materialization of triples. It utilizes an MPP OLAP engine serving use cases where calculations and analytics need to be performed across the whole of an knowledge graph. It also supports querying via and . A library of analytics functions is provided with the product, and ML frameworks supported include , , , , and .

The Anzo Knowledge Graph Platform adds capabilities for knowledge graph management, metadata management, visual schema and query design tools, data ingestion, and integration with analytics and business intelligence (BI) tools.

DataStax

is based in Santa Clara, California, and added graph capabilities to in 2015. The capabilities are also available in DataStax’s cloud DBMS offering, . A nonrelational multimodel DBMS, it supports property graphs using and for a query language. With its history as the leading commercializer of , DataStax offers an open-core version of that DBMS, with added features in the enterprise version including graph support. Version 6.8 offers graph data models implemented natively within Cassandra.

supports developers writing queries in Cassandra Query Language (CQL), Graph/Gremlin, and Spark SQL language, and offers a collaborative notebook interface as well as visual exploration of DSE graphs without requiring Gremlin skills. DataStax Enterprise also offers support for both Gremlin algorithms and , which integrate with . This enables the multimodel aspect of the product to support combinations of technologies across a variety of data collections for analytics use cases and ML. Given Cassandra’s broad adoption for operational use cases, this provides DataStax market differentiation.

Dgraph

is based in Palo Alto, California, and is a recent entrant to the graph DBMS market with the Dgraph product in 2016. The platform is entirely written in the Go language with features aimed at the application development community, which is increasingly using as an API layer on top of graph data structures, and enterprise customers that use GraphQL as a layer to collect data from multiple back ends and present a unified service or API. There is a community edition available for download, together with hosted public and private cloud solutions in AWS, , and with pricing based on CPU nodes.

Dgraph is a native graph DBMS that handles JSON data as well as triples. Querying the database, however, is done using the and . The platform is designed for real-time transactional workloads utilizing relationship-based sharding to optimize queries and traversals across distributed clusters. Dgraph is optimized for transactional workloads. Its native support for graph algorithms includes recursive traversal of graph and k-shortest path algorithms. Dgraph plans to support in upcoming releases. The product fits into the GraphQL ecosystem, including tools for querying and visualization.

Franz

is based in Lafayette, California, and entered the graph database market in 2006 with , a multimodel triplestore supporting documents and graphs that implements RDF*, , and . The platform can be deployed in on-premises and cloud environments with pricing based on CPU cores.

An RDFS++ runtime reasoner allows usage of the modeling language and a subset of terms from the Web Ontology Language () at query time. OWL2 RL support and inference are also available using static materialization.

A unique feature of AllegroGraph since its first release is the use of Prolog as a mechanism to extend or customize the model and reasoning capabilities. AllegroGraph can federate queries in parallel across multiple distributed triplestores using its proprietary FedShard technology. AllegroGraph’s Triple Attributes Security uniquely addresses high-security data environments through role-based, cell-level data access.

Graph traversal, graph analytics, graph algorithms, and ML are all included natively within the product or through extensions with third-party libraries, such as and Python libraries, providing an interface from to . Graph visualization and exploratory analysis are supported by , a no-code tool natively integrated into the platform.

MarkLogic

is based in San Carlos, California, and entered the graph DBMS market in 2013 as a document-based multimodel product with a focus on knowledge graph use cases. Unlike most other multimodel offerings, it also has a native triplestore permitting it to optimally store data not already inside documents.

The MarkLogic Semantics capability is built into the core product and sold as a license option. It enables direct querying in and reasoning support for and ontologies. Custom rules can be defined using MarkLogic’s own language, which is based on the Construct operator.

MarkLogic can be deployed on-premises, in the cloud, or in hybrid deployments, and has APIs for SPARQL, JavaScript, XQuery, Search, SQL, REST, Java, Node.js, and Optic queries. It offers a free developer version and pricing by cores and/or consumption. It provides tools focused on data curation and access — queries can be against both data and metadata within the same query, making it well-suited to building data hubs. It also supports and for ML.

Microsoft

, based in Redmond, Washington, provides graph capabilities in three offerings: , SQL Server, and . Azure Cosmos DB is a nonrelational multimodel DBMS deployed as a managed service in the cloud. It provides a property graph model and supports as a query language.

Visual design and query tools from the Gremlin ecosystem, such as and , are recommended. Algorithms are available through Gremlin recipes. Microsoft provides a Spark connector to enable ML via the .

Neo4j

was founded in 2007 and is based in San Mateo, California. It is a native graph store supporting the property graph model. An open-source version of the platform is available, as well as an enterprise version that can be deployed across on-premises and cloud environments. Its managed SaaS service is called . Pricing for self-managed installations is based on the number of machines, cores, and RAM.

Neo4j is the creator of the query language, which has been adopted by other graph databases as . It also supports for graph traversals. Connectors for BI tools, streaming, Spark, and ingestion from various databases and file formats are also available. Neo4j is a member and active participant in the development of the .

Neo4j for Graph Data Science includes over 50 algorithms, graph embeddings, and support for supervised and unsupervised ML. Visualization, schema design, and data source connection tools are also available.

Ontotext

is based in Bulgaria and provides , a native triplestore supporting RDF* and SPARQL*. Reasoning and inference for , OWL Lite, and OWL2 RL/QL are supported via materialization of triples at load time. Virtualization over RDBMS systems is supported, enabling queries against relational databases.

The Ontotext Platform adds capabilities for schema creation, text processing pipelines, and platform deployment via Kubernetes. Developers can use interfaces to overcome the need for queries.

offers , a multimodel DBMS supporting both relational and data. Virtuoso enables data virtualization and hosts the , a collection of datasets accessible as a knowledge graph on the web. The platform supports inference on subclass and subproperty constructs, enabling graph traversal.

Oracle

entered the graph DBMS market in 2009 with graph capabilities in its multimodel Oracle DBMS. It supports and . The Oracle DBMS includes a library of graph algorithms and supports ML with frameworks like pandas and NumPy.

Redis Labs

offers , a graph database module for Redis supporting the query language. It integrates with for ML applications.

SAP

supports property graphs with query via , , and SQL. It includes a library of graph algorithms and integrates with .

Stardog

provides the Stardog Enterprise Knowledge Graph Platform, a triplestore supporting RDF* and SPARQL*. It supports in-database ML and virtual graphs to represent data without duplication.

TIBCO Software

offers , supporting property graphs and querying with . It integrates with .

TigerGraph

supports labeled property graphs with its . It features prebuilt schemas and tools for ML and analytics.

Footnotes

References

  1. Heiko Paulheim (2016). Knowledge graph refinement: A survey of approaches and evaluation methods. Semantic Web.
  2. Gerhard Goos and Juris Hartmanis and Jan Leeuwen and David Hutchison and Jeff Z. Pan and Huajun Chen and Hong Gee Kim and Juan-Zi Li and Zhe Wu and Ian Horrocks and Riichiro Mizoguchi and Zhaohui Wu (2016). The Semantic Web. Lecture Notes in Computer Science.
  3. Andreas Blumauer (2016). From Taxonomies over Ontologies to Knowledge Graphs.
  4. Michael Färber and Achim Rettinger (2015). A Statistical Comparison of Current Knowledge Bases. International Conference on Semantic Systems.
  5. Jay Pujara and Hui Miao and Lise Getoor and William W. Cohen (2013). Knowledge Graph Identification. International Workshop on the Semantic Web.
  6. Lisa Ehrlinger and Wolfram Wöß (2016). Towards a Definition of Knowledge Graphs. International Conference on Semantic Systems.
  7. Maximilian Nickel and Volker Tresp (2013). Tensor factorization for multi-relational learning. Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part III.
  8. Antoine Bordes and Nicolas Usunier and Alberto Garcı́a-Durán and Jason Weston and Oksana Yakhnenko (2013). Translating Embeddings for Modeling Multi-relational Data. Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States.
  9. Ivana Balazevic and Carl Allen and Timothy M. Hospedales (2019). Multi-relational Poincaré Graph Embeddings. Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8–14 December 2019, Vancouver, BC, Canada.
  10. Rana Hussein and Dingqi Yang and Philippe Cudré-Mauroux (2018). Are Meta-Paths Necessary?: Revisiting Heterogeneous Graph Embeddings. Proceedings of the 27th ACM International Conference on Information and Knowledge Management, CIKM 2018, Torino, Italy, October 22-26, 2018.
  11. Xiao Wang and Houye Ji and Chuan Shi and Bai Wang and Yanfang Ye and Peng Cui and Philip S. Yu (2019). Heterogeneous Graph Attention Network. The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May 13-17, 2019.
  12. Luwei Yang and Zhibo Xiao and Wen Jiang and Yi Wei and Yi Hu and Hao Wang (2020). Dynamic Heterogeneous Graph Embedding Using Hierarchical Attentions. Advances in Information Retrieval - 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14-17, 2020, Proceedings, Part II.
  13. Yizhou Sun and Jiawei Han and Xifeng Yan and Philip S. Yu and Tianyi Wu (2011). Pathsim: Meta path-based top-k similarity search in heterogeneous information networks. Proceedings of the VLDB Endowment.
  14. Yizhou Sun and Jiawei Han (2012). Mining Heterogeneous Information Networks: Principles and Methodologies. Morgan & Claypool.
The opinions and views expressed on this blog are solely my own and do not reflect the opinions, views, or positions of my employer or any affiliated organizations. All content provided on this blog is for informational purposes only.
An Introduction to Knowledge Graphs | Ahmad Assaf's Personal Space