An Introduction to Knowledge Graphs
A comprehensive guide to Knowledge Graphs and their applications
37 min read
This post is part of
A Semantic Web series
- Introduction to Semantic Web
- An Introduction to Knowledge Graphs
- Everything You Need to Know About RDF
- How to Query RDF: A Comprehensive Guide to SPARQL
In an era defined by data, making sense of complex relationships and vast information repositories has become critical for organizations. are emerging as a powerful tool to tackle these challenges. By organizing data into a network of interconnected entities and relationships, knowledge graphs provide a structured way to represent information and extract insights.
In traditional data systems, information is often stored in siloed databases, making it difficult to extract meaningful insights. For instance:
- Disconnected Data: Systems that store data independently struggle to provide a unified view.
- Poor Relationship Representation: Conventional databases often fall short in modeling complex interconnections between entities.
- Search Limitations: Searching across structured and unstructured data seamlessly is challenging.
A knowledge graph addresses these problems by integrating data from various sources into a unified graph structure. It represents as (nodes) that connect to each other through various relationships (edges) in a format that is both human-readable and machine-interpretable. This framework acts as a database, enabling complex queries by understanding the context and connections between various pieces of information. Knowledge graphs enhance AI applications by improving information retrieval and reasoning capabilities across multiple data sources powering tasks like semantic search, recommendation systems, and more.
What is a Knowledge Graph (KG)?
The term knowledge graph has been used frequently in research and business, usually in close association with Semantic Web technologies, linked data, large-scale data analytics and cloud computing. The term "knowledge graph" is often mistakenly thought to have originated in 2012, when Google adopted it to describe its structured entity-attribute information, prominently featured on its search results pages. While Google's use of the term has significantly boosted its visibility and marketing appeal, the concept dates back much further. The phrase "knowledge graph" itself can be traced to the 1970s, and the underlying ideas go back even earlier.
There has been lots of efforts to clearly define but to put it simply; a knowledge graph is a network-based representation of knowledge that organizes data from multiple sources and captures information about entities of interest and the relationships between them. They are:
- Graphs: unlike knowledge bases, the content of KGs is organised as a graph, where nodes (entities of interest and their types), relationships between and attributes of the nodes are equally important. This makes it easy to integrate new datasets and formats and supports exploration by navigating from one part of the graph to the other through links.
- Semantic: the meaning of the data is encoded for programmatic use in an ontology, which describes the types of entities in the graph and their characteristics and can be represented as a schema sub-graph. This means that the graph is both a place to organise and store data, and to reason what it is about and derive new information.
Knowledge Graphs consists of:
- Nodes: Representing entities like people, places, things, or abstract concepts
- Edges: Connections between nodes showing relationships
- Labels: Attributes that define the relationships and reasoning rules
At its core, a knowledge graph is data structure that connects data in a semantic way, allowing both humans and machines to understand the context and meaning of the information. Some live examples of Knowledge Graphs are the that powers search results with contextual insights and that models professional connections and job market trends.
Extended Intro on Knowledge Graphs from the
Although the term “knowledge graph” has appeared in academic literature since at least 1972 [1], its modern usage gained prominence following Google's 2012 announcement of the Google Knowledge Graph [2]. This was soon followed by similar announcements from other major companies, including Airbnb [1], Amazon [2], eBay [3], Facebook [3], IBM [2], LinkedIn [3], Microsoft [8], and Uber [3]. The increasing adoption of knowledge graphs in industry has sparked a surge of academic interest, resulting in a growing body of scientific literature on the topic. This includes books (e.g., [7] [9] [3] [6]), papers defining the concept (e.g., [3]), innovative methodologies (e.g., [12] [17] [9]), and surveys focusing on specific aspects of knowledge graphs (e.g., [12] [19]).
Central to these developments is the fundamental principle of representing data as graphs, often augmented with explicit mechanisms to encode knowledge [10]. This approach is commonly used in applications that require the integration, management, and extraction of value from large-scale, heterogeneous data sources [10]. Graph-based knowledge representation offers several advantages compared to relational databases or NoSQL alternatives. Graphs provide an intuitive and compact abstraction for modeling diverse domains, with edges naturally capturing potentially cyclical relationships inherent in areas such as social networks, biological systems, bibliographic citations, transport networks, and more [1]. They also enable schema flexibility, allowing data and its scope to evolve dynamically, which is particularly valuable for handling incomplete knowledge [1]. Unlike other NoSQL approaches, graph-specific query languages support not only traditional relational operations (e.g., joins, unions, projections) but also navigational queries to discover entities linked by paths of arbitrary lengths [2].
Additionally, standard knowledge representation frameworks—such as ontologies [10] [4] [15] and rule-based systems [12] [14]—can define and reason about the semantics of the nodes and edges in the graph. Scalable frameworks for graph analytics [17] [30] [28] enable tasks like computing centrality, clustering, and summarization to derive insights about the domain. Furthermore, specialized graph representations have been developed to facilitate the application of machine learning techniques, both directly and indirectly, on graph data [29] [31].
Definition | Source |
---|---|
"A knowledge graph (i) mainly describes real-world entities and their interrelations, organized in a graph, (ii) defines possible classes and relations of entities in a schema, (iii) allows for potentially interrelating arbitrary entities with each other and (iv) covers various topical domains." | Paulheim [22] |
"Knowledge graphs are large networks of entities, their semantic types, properties, and relationships between entities." | Journal of Web Semantics [9] |
"Knowledge graphs could be envisaged as a network of all kinds of things which are relevant to a specific domain or to an organization. They are not limited to abstract concepts and relations but can also contain instances of things like documents and datasets." | Semantic Web Company [4] |
"We define a Knowledge Graph as an graph. An graph consists of a set of triples where each triple is an ordered set of the following terms: a subject , a predicate , and an object . An term is either a URI , a blank node , or a literal ." | Färber et al. [9] |
"[...] systems exist, [...], which use a variety of techniques to extract new knowledge, in the form of facts, from the web. These facts are interrelated, and hence, recently this extracted knowledge has been referred to as a knowledge graph." | Pujara et al. [28] |
Different Types of Information Management Systems
To appreciate the uniqueness of knowledge graphs, it’s helpful to understand how they compare to other :
- : stores data in a row-based table structure which connects related data elements An RDBMS includes functions that maintain the security, accuracy, integrity and consistency of the data.
- : de-normalised data stores that allow for analytical activities like count, aggregation, etc.
- : data storage paradigm designed for storing, retrieving, and managing associative arrays , and a data structure more commonly known today as a dictionary or hash table.
- : stores data tables by column rather than by row. Benefits include more efficient access to data when only querying a subset of columns (by eliminating the need to read columns that are not relevant), and more options for data compression
- : uses graph structures for semantic queries with nodes, edges, and properties to represent and store data
- : data storage system designed for storing, retrieving and managing document-oriented information, also known as semi-structured data
- : optimized for handling time series data, i.e., data points indexed in time order
- : supports multiple data models against a single, integrated backend
Under the hood
The key difference between a graph and relational database is that relational databases work with sets while graph databases work with paths. This manifests itself in unexpected and unhelpful ways for a Relational Database Management System (RDBMS) user.
For example when trying to emulate path operations (e.g. friends of friends) by recursively joining in a relational database, query latency grows unpredictably and massively as does memory usage, not to mention that it tortures SQL to express those kinds of operations. More data means slower in a set-based database, even if you can delay the pain through judicious indexing.
Most graph databases don't suffer this kind of join pain because they express relationships at a fundamental level. That is, relationships physically exist on disk and they are named, directed, and can be themselves decorated with properties (the property graph model). This means if you chose to, you could look at the relationships on disk and see how they "join" entities. Relationships are therefore first-class entities in a graph database and are semantically far stronger than those implied relationships reified at runtime in a relational store.
tldr;
- Graph databases are much faster than relational databases for connected data - a strength of the underlying model. A consequence of this is that query latency in a graph database is proportional to how much of the graph you choose to explore in a query, and is not proportional to the amount of data stored, thus defusing the .
- Graph databases make modelling and querying much more pleasant meaning faster development
How to determine if Knowledge Graphs are what you need?
1. Is your Data Highly-Connected?
Graph solutions are focused on highly-connected data that comes with an intrinsic need for relationship analysis. If the connections within the data are not the primary focus and the data is of a transactional nature, then a graph database is probably not the best fit.
2. Is Retrieving the Data more Important than Storing it?
Graph databases are optimized for data retrieval and you should go with the graph database if you intend to retrieve data often. If your focus is on writing to the database and you’re not concerned with analyzing the data, then a graph database wouldn’t be an appropriate solution. A good rule of thumb is, if you don’t intend to use JOIN operations in your queries, then a graph is not a must-have.
3. Does your Data Model Change Often?
If your data model is inconsistent and demands frequent changes, then using a graph database might be the way to go. Because graph databases are more about the data itself than the schema structure, they allow a degree of flexibility.
On the other hand, there are often benefits in having a predefined and consistent table that’s easy to understand. Developers are comfortable and used to relational databases and that fact cannot be downplayed.
For example, if you are storing personal information such as names, dates of birth, locations… and don’t expect many new fields or a change in data types, relational databases are the go-to solution. On the other hand, a graph database could be useful if:
- Additional attributes could be added at some point,
- Not all entities will have all the attributes in the table and
- The attribute types are not strictly defined.
Graphs as data structures
A Graph is a non-linear data structure consisting of vertices and edges. The vertices are sometimes also referred to as nodes and the edges are lines or arcs that connect any two nodes in the graph. More formally, a knowledge graph as a directed labeled graph is a 4-tuple , where is a set of nodes, is a set of edges, is a set of labels, and is an assignment function from edges to labels. An assignment of a label to an edge can be viewed as a triple .
Types of Graphs
- Null Graph: A graph is known as a null graph if there are no edges in the graph
- Trivial Graph: Graph having only a single vertex, it is also the smallest graph possible
- Undirected Graph: A graph in which edges do not have any direction. That is the nodes are unordered pairs in the definition of every edge.
- Directed Graph: A graph in which edge has direction. That is the nodes are ordered pairs in the definition of every edge.
- Labeled Graph: A graph where edges are labelled (can have properties for the relationships.
- Connected Graph: The graph in which from one node we can visit any other node in the graph is known as a connected graph.
- Disconnected Graph: The graph in which at least one node is not reachable from a node is known as a disconnected graph.
- Regular Graph: The graph in which the degree of every vertex is equal to K is called K regular graph.
- Complete Graph: The graph in which from each node there is an edge to each other node
- Cycle Graph: The graph in which the graph is a cycle in itself, the degree of each vertex is 2.
- Cyclic Graphs: A graph containing at least one cycle is known as a Cyclic graph.
- Directed Acyclic Graph: A Directed Graph that does not contain any cycle.
- Bipartite Graph: A graph in which vertex can be divided into two sets such that vertex in each set does not contain any edge between them.
- Weighted Graph: A graph in which the edges are already specified with suitable weight is known as a weighted graph. Weighted graphs can be further classified as directed weighted graphs and undirected weighted graphs.
you can have a mix of types between those types e.g., Directed Cyclic Graphs, Directed Labeled Cyclic Graphs, Directed Labeled Cyclic Multigraph, etc.
Graphs as data models
Directed edge-labelled graphs
A directed edge-labelled graph (or multi-relational graph [4, 6, 25]) is a set of nodes connected by directed, labelled edges. In knowledge graphs, nodes represent entities, and edges represent relationships between them.
Key Features
- Flexible Data Representation: Graphs allow for integrating new data sources more flexibly than relational databases, which require predefined schemas. Unlike hierarchical data models (e.g., XML, JSON), graphs allow cycles and avoid rigid hierarchical structuring.
- Bidirectional Edges: For clarity, bidirectional edges can represent two directed edges.
- Incomplete Data: Missing information can simply be omitted, such as when the graph lacks start/end dates for an event.
A standardised data model based on directed edge-labelled graphs is the Resource Description Framework (RDF) which has been recommended by the for representing knowledge graphs on the web. The model defines different types of nodes, including which allow for global identification of entities on the Web; literals, which allow for representing strings (with or without language tags) and other datatype values (integers, dates, etc.); and blank nodes, which are anonymous nodes that are not assigned an identifier.
Everything in an graph is called a resource. “Edge” and “Node” are just the roles played by a resource in a given statement. Fundamentally in , there is no difference between resources playing an edge role and resources playing a node role. An edge in one statement can be a node in another. We will give examples of this in the diagrams that follow that will make this core idea clearer.
There is a standard query language for Graphs called . It is both, a full featured query language and an HTTP protocol making it possible to send query requests to endpoints over HTTP. A key part of the standard is the definition of serializations. The most commonly used serialization format is called Turtle. There is also a JSON serialization called as well as an XML serialization. All databases are able to export and import graph content in standard serializations making it easy and seamless to interchange data.
Heterogeneous Graphs
A heterogeneous graph [19, 40, 44] (or heterogeneous information network [39, 40]) is a directed graph where each node and edge is assigned one type. Heterogeneous graphs are similar to directed edge-labelled graphs, with edge labels corresponding to edge types, but they also include node types as part of the graph model.
An edge is called homogeneous if it connects two nodes of the same type and heterogeneous if it connects nodes of different types. Heterogeneous graphs allow partitioning nodes by their type, which is useful for machine learning tasks [19] [42] [46].
In contrast, directed edge-labelled graphs support a more flexible model where nodes can have zero or multiple types.
Property Graphs
While there are core commonalities in property graph implementations, there is no true standard property graph data model. Each implementation of a Property Graph is, therefore, somewhat different. The following discusses the characteristics that are common for any property graph database.
Generally, the property graph data model consists of three elements:
- Nodes: The entities in the graph. Nodes can be tagged with zero to many text labels representing their type. Nodes are also called vertices.
- Edges: The directed links between nodes. Edges are also called relationships. The “from node” of a relationship is called the source node. The “to node” is called the target node. Each edge has a type. While edges are directed, they can be navigated and queried in either direction.
- Properties: The key-value pairs associated with a node or with an edge.
Property values can have data types. Supported data types depend on the vendor. For example, Neo4j data types are similar, but not identical, to Java language data types.
A key part of any data model is having a query language available for working with it. After all, users need to have a way to access and manipulate the data in the graph. No industry standard query language exists for property graphs. Instead, each database offers their own, unique query language that is incompatible with others:
- offers also known as CQL—its own query language that, to some extent, took SQL as an inspiration.
- offers GSQL—its own query language that also took SQL as an inspiration.
- MS SQL Graph has their own extension to SQL to support graph query.
- Some vendors, in addition to their own query language, also implement some subset of Cypher. For example, SAP Hana offers its own extensions to SQL and its own GraphScript language plus they support a subset of Cypher.
There is also ; which is an open source graph computing framework that is integrated with some property graph and graph databases. It offers the Gremlin language which is more of an API language than a query language.
A key requirement for working with any data model is the ability to reference nodes, properties and relationships (edges). In the case of property graphs, internally, nodes and edges have IDs. IDs are assigned by a database and are internal to a database. Referencing is done by using text strings—node labels, relationship types, and property names.
vs. Property Graph
Feature | Property Graph | |
---|---|---|
Expressivity | Arbitrary complex descriptions via links to other nodes; no properties on edges out of the box. With the model gets much more expressive than property graphs | Limited expressivity; beyond the basic directed cyclic labeled graph properties (KV pairs) for nodes and edge |
Formal Semantics | ✅ standards schema and model semantics foster reuse and inference | ❌ No formal model representation |
Standardisation | Driven by W3C working groups and standardisation processes | Different competing vendors |
Query Language | W3C standard | Cypher, PGQLm GCORE, GQL → no standard |
Serialisation Formant | ✅ Multiple serialisation formats | ❌ No serialisation format |
Schema Language | ✅ , , Shapes | ❌ None |
Design goal | Linked Data (publishing and linking data with formal semantic and no central control) | Graph representation for analytics |
Processing Strengths | Set analysis operations (as in SQL but with schema abstraction and flexibility) | Graph Traversal (plenty of graph analytics and ML Libs) |
tldr; The main advantages of RDF
The Data Model provides a richer, semantically consistent foundation over property graphs.
Text values can also have language tags to support internationalisation of data. For example, instead of a single value for rdfs:label for New York City we could have multiple values such as:
“New York City” xsd:string @en
“Nueva York” xsd:string @sp
A key differentiator is how the underlying model (schema) is represented in the same way as the data. Just to serve as a primer,
rdf:type
is a predicate used to connect a resource with a class it belongs to;rdfs:label
is used to provide a display name for a resource. The uniformity of the data model makes Graphs more easily evolvable and gives them more flexibility compared to Property Graphs.Enrichment Through Composition: With the inherent composability of Graphs, when two nodes have the same URI, they are automatically merged. This means that you can load different files and their content will be joined together forming a larger and more interesting graph.
Having data in standard format allows for the ease of integration with the wealth of Open Data available e.g.m DBpedia, Geonames, Open Corporates, etc.
No vendor lock in, its all open source and W3C Standards
So, in the end, what is a Knowledge Graph?
A Knowledge Graph is a connected data structure of data and associated metadata applied to model, integrate and access information assets. The knowledge graph represents real-world entities, facts, concepts, and events as well as the relationships between them. Knowledge graphs yield a more accurate and comprehensive representation of data.
Knowledge Graphs (KGs) have emerged as a compelling abstraction for organising the world’s structured knowledge, and as a way to integrate information extracted from multiple data sources. Knowledge graphs have started to play a central role in representing the information extracted using natural language processing and computer vision. Domain knowledge expressed in KGs is being input into machine learning models to produce better predictions.
The heart of the knowledge graph is a knowledge model – a collection of interlinked descriptions of concepts, entities, relationships and events where:
- Descriptions have formal semantics that allow both people and computers to process them in an efficient and unambiguous manner;
- Descriptions contribute to one another, forming a network, where each entity represents part of the description of the entities related to it;
- Diverse data is connected and described by semantic metadata according to the knowledge model.
Knowledge graphs combine characteristics of several data management paradigms:
- Database, because the data can be explored via structured queries;
- Graph, because they can be analysed as any other network data structure;
- Knowledge base, because they bear formal semantics, which can be used to interpret the data and infer new facts.
Knowledge graphs have a number of benefits over conventional relational databases and document stores. Specifically:
- A unified, single source of truth
- Flexible and highly adaptable data structure
- Can represent knowledge in any domain
- Wide range of tooling for data model definition and control
- Ability to link and enrich data
- Huge open source library of linked data
- The perfect playground for virtually all ML tasks
Knowledge graphs, represented in RDF, provide the best framework for data integration, unification, linking and reuse, because they combine:
- Expressivity: The standards in the Semantic Web stack – , and – allow for a fluent representation of various types of data and content: data schema, taxonomies and vocabularies, all sorts of metadata, reference and master data. The extension makes it easy to model provenance and other structured metadata.
- Performance: All the specifications have been thought out, and proven in practice, to allow for efficient management of graphs of billions of facts and properties.
- Interoperability: There is a range of specifications for data serialization, access ( Protocol for end-points), management ( Graph Store) and federation. The use of globally unique identifiers facilitates data integration and publishing.
- Standardization: All the above is standardized through the W3C community process, to make sure that the requirements of different actors are satisfied – all the way from logicians to enterprise data management professionals and system operations teams.
What is NOT a Knowledge Graph?
Not every graph is a knowledge graph. For instance, a set of statistical data, e.g. the GDP data for countries, represented in is not a KG. A graph representation of data is often useful, but it might be unnecessary to capture the semantic knowledge of the data. It might be sufficient for an application to just have a string ‘Italy’ associated with the string ‘GDP’ and a number ‘1.95 trillion’ without needing to define what countries are or what the ‘Gross Domestic Product’ of a country is. It’s the connections and the graph that make the KG, not the language used to represent the data.
Not every knowledge base is a knowledge graph. A key feature of a KG is that entity descriptions should be interlinked to one another. The definition of one entity includes another entity. This linking is how the graph forms. (e.g. A is B. B is C. C has D. A has D). Knowledge bases without formal structure and semantics, e.g. Q&A “knowledge base” about a software product, also do not represent a KG. It is possible to have an expert system that has a collection of data organized in a format that is not a graph but uses automated deductive processes such as a set of ‘if-then’ rules to facilitate analysis.
Why Knowledge Graphs are very exciting for ML?
Bringing knowledge graphs and machine learning together will systematically improve the accuracy of the systems and extend the range of machine learning capabilities. We are particularly interested in their applications in:
Data Insufficiency
Having a sufficient amount of data to train a machine learning model is very important. In the case of sparse data, Knowledge Graph can be used to augment the training data, e.g., replacing the entity name from original training data with an entity name of a similar type. This way a huge number of both positive and negative examples can be created using Knowledge Graph.
Zero-Shot Learning
Today, the main challenge with a Machine Learning model is that without a properly trained data it can not distinguish between two data points. In Machine Learning, this is considered as Zero-Shot Learning problem. This is where knowledge graphs can play a very big role. The induction from the Machine Learning model can be complemented with a deduction from the Knowledge Graph, e.g., where the type of situation did not appear in the training data.
Explainability
One of the major problems in machine learning industry is explaining the predictions made by machine learning systems. One issue is the implicit representations causing the predictions from the machine learning models. Knowledge Graph can alleviate this problem by mapping the explanations to some proper nodes in the graph and summarizing the decision-taking process.
Appendix
Graph DBMS vendors
Vendor Product | Native or Multimodel | Supported Models | Deployment Platforms | Query Language | Supported Graph Algorithms/Libraries | License Model | Pricing Model (Nodes, Users, Consumption) |
---|---|---|---|---|---|---|---|
Native | Property, | Cloud | TinkerPop, Gremlin, | TinkerPop | Open source, managed service | On-demand instances, storage, I/O, backups, data transfer | |
Native | , (property) | On-premises, multicloud, hybrid | OpenCypher, | Built-in | Freemium, subscription | vCPU cores | |
and | Multimodel | Property | On-premises, multicloud | TinkerPop, Gremlin, GraphQL | TinkerPop | Open core | Nodes and consumption |
Native | GraphQL, JSON, | On-premises, cloud | DQL, GraphQL | Built-in | Open source, Apache 2.0 | CPUs per node | |
Multimodel | Document, graph (JSON-LD), , , | On-premises, multicloud, hybrid | SPARQL, SPARQL*, FedShard-Parallel, SPARQL, GraphQL, Prolog/Datalog, Lisp, JIG/Gremlin, domain-specific languages | Built-in | Closed source | CPU cores | |
Multimodel | Triples | On-premises, multicloud, hybrid | JavaScript, Optic, Search, SPARQL, XQuery | Built-in | Closed source (free developer version) | Cores, consumption (free developer version) | |
Multimodel | Property | Cloud | Gremlin | Built-in | Open source, Apache 2.0 | Throughput capacity, serverless consumption | |
Native | Property | On-premises, multicloud, hybrid | Cypher (openCypher), GraphQL, RDF/SPARQL, SQL | Built-in | Open core, managed service | SaaS: RAM, consumption; on-premises: machines/cores/RAM | |
Native | , OWL/ | On-premises, multicloud | GraphQL, SPARQL, SPARQL*,SQL | Built-in | Perpetual, subscription, limited free version | per-CPU | |
Multimodel | On-premises, multicloud, hybrid | SPARQL, SQL | Built-in | Closed source | Concurrent users and CPU affinity, per node | ||
Multimodel | Property, | On-premises, multicloud, hybrid | PGQL, | Built-in | Perpetual, subscription | Perpetual: user/server/enterprise; subscription: consumption | |
Multimodel | Property | On-premises, multicloud, hybrid | Cypher | LAGraph | Perpetual, subscription | Consumption (RGUs based on memory and throughput) | |
Multimodel | Property | GraphScript (proprietary), openCypher, SQL, SQLScript | Built-in | Perpetual, subscription | Perpetual: users; subscription: consumption | ||
Native | , | On-premises, multicloud, hybrid | GraphQL, SPARQL, SQL | Built-in | Subscription | Nodes, consumption | |
Native | Property | On-premises, multicloud, hybrid | Gremlin | Built-in | Freemium, perpetual, subscription | On-premises: cores, connection; subscription: consumption | |
Native | Labeled property | On-premises, multicloud | GSQL | Built-in | Freemium, perpetual, subscription | On-premises: available RAM for data storage; cloud: vcPU, RAM, disk size, I/O |
Vendor Profiles
Amazon Web Services (AWS)
, based in Seattle, introduced in 2018 as a cloud-only managed service and claims thousands of active customers. Neptune is a native graph DBMS supporting and the W3C’s models. ACID transactions, in-memory execution, up to 15 read replicas and high availability are part of the service. Instances are priced by the hour and billed per second with no long-term commitments, and with added charges for storage, IOs, backup, and data transfer in and out. The service supports graphs of up to 64TB, encryption-at-rest with customer-managed keys and cross-region snapshot sharing.
Neptune can be queried with W3C’s as well as to build graph applications and implement custom graph algorithms. Open-source tools are available on Github under Apache 2.0 and MIT licenses. AWS is an active contributor to the open-source project.
The , an open-source Apache2 Jupyter Notebook developed by AWS, allows customers to visualize the results of queries run against the graph database and to get started with sample graph applications. supports ML to make predictions over graph data using graph neural networks (GNNs) from the .
Cambridge Semantics
is based in Boston and has offered , a knowledge graph platform that includes the , since 2015 with freemium and enterprise software subscriptions on-premises, and via Cloud Marketplaces in multicloud and hybrid deployment. Pricing for both the platform and engine is based on the number of vCPU cores used by the graph engine.
version 2.2 is a native graph engine supporting and for property graph use cases. Inferencing is performed for and OWL 2 RL ontologies using in-memory materialization of triples. It utilizes an MPP OLAP engine serving use cases where calculations and analytics need to be performed across the whole of an knowledge graph. It also supports querying via and . A library of analytics functions is provided with the product, and ML frameworks supported include , , , , and .
The Anzo Knowledge Graph Platform adds capabilities for knowledge graph management, metadata management, visual schema and query design tools, data ingestion, and integration with analytics and business intelligence (BI) tools.
DataStax
is based in Santa Clara, California, and added graph capabilities to in 2015. The capabilities are also available in DataStax’s cloud DBMS offering, . A nonrelational multimodel DBMS, it supports property graphs using and for a query language. With its history as the leading commercializer of , DataStax offers an open-core version of that DBMS, with added features in the enterprise version including graph support. Version 6.8 offers graph data models implemented natively within Cassandra.
supports developers writing queries in Cassandra Query Language (CQL), Graph/Gremlin, and Spark SQL language, and offers a collaborative notebook interface as well as visual exploration of DSE graphs without requiring Gremlin skills. DataStax Enterprise also offers support for both Gremlin algorithms and , which integrate with . This enables the multimodel aspect of the product to support combinations of technologies across a variety of data collections for analytics use cases and ML. Given Cassandra’s broad adoption for operational use cases, this provides DataStax market differentiation.
Dgraph
is based in Palo Alto, California, and is a recent entrant to the graph DBMS market with the Dgraph product in 2016. The platform is entirely written in the Go language with features aimed at the application development community, which is increasingly using as an API layer on top of graph data structures, and enterprise customers that use GraphQL as a layer to collect data from multiple back ends and present a unified service or API. There is a community edition available for download, together with hosted public and private cloud solutions in AWS, , and with pricing based on CPU nodes.
Dgraph is a native graph DBMS that handles JSON data as well as triples. Querying the database, however, is done using the and . The platform is designed for real-time transactional workloads utilizing relationship-based sharding to optimize queries and traversals across distributed clusters. Dgraph is optimized for transactional workloads. Its native support for graph algorithms includes recursive traversal of graph and k-shortest path algorithms. Dgraph plans to support in upcoming releases. The product fits into the GraphQL ecosystem, including tools for querying and visualization.
Franz
is based in Lafayette, California, and entered the graph database market in 2006 with , a multimodel triplestore supporting documents and graphs that implements RDF*, , and . The platform can be deployed in on-premises and cloud environments with pricing based on CPU cores.
An RDFS++ runtime reasoner allows usage of the modeling language and a subset of terms from the Web Ontology Language () at query time. OWL2 RL support and inference are also available using static materialization.
A unique feature of AllegroGraph since its first release is the use of Prolog as a mechanism to extend or customize the model and reasoning capabilities. AllegroGraph can federate queries in parallel across multiple distributed triplestores using its proprietary FedShard technology. AllegroGraph’s Triple Attributes Security uniquely addresses high-security data environments through role-based, cell-level data access.
Graph traversal, graph analytics, graph algorithms, and ML are all included natively within the product or through extensions with third-party libraries, such as and Python libraries, providing an interface from to . Graph visualization and exploratory analysis are supported by , a no-code tool natively integrated into the platform.
MarkLogic
is based in San Carlos, California, and entered the graph DBMS market in 2013 as a document-based multimodel product with a focus on knowledge graph use cases. Unlike most other multimodel offerings, it also has a native triplestore permitting it to optimally store data not already inside documents.
The MarkLogic Semantics capability is built into the core product and sold as a license option. It enables direct querying in and reasoning support for and ontologies. Custom rules can be defined using MarkLogic’s own language, which is based on the Construct operator.
MarkLogic can be deployed on-premises, in the cloud, or in hybrid deployments, and has APIs for SPARQL, JavaScript, XQuery, Search, SQL, REST, Java, Node.js, and Optic queries. It offers a free developer version and pricing by cores and/or consumption. It provides tools focused on data curation and access — queries can be against both data and metadata within the same query, making it well-suited to building data hubs. It also supports and for ML.
Microsoft
, based in Redmond, Washington, provides graph capabilities in three offerings: , SQL Server, and . Azure Cosmos DB is a nonrelational multimodel DBMS deployed as a managed service in the cloud. It provides a property graph model and supports as a query language.
Visual design and query tools from the Gremlin ecosystem, such as and , are recommended. Algorithms are available through Gremlin recipes. Microsoft provides a Spark connector to enable ML via the .
Neo4j
was founded in 2007 and is based in San Mateo, California. It is a native graph store supporting the property graph model. An open-source version of the platform is available, as well as an enterprise version that can be deployed across on-premises and cloud environments. Its managed SaaS service is called . Pricing for self-managed installations is based on the number of machines, cores, and RAM.
Neo4j is the creator of the query language, which has been adopted by other graph databases as . It also supports for graph traversals. Connectors for BI tools, streaming, Spark, and ingestion from various databases and file formats are also available. Neo4j is a member and active participant in the development of the .
Neo4j for Graph Data Science includes over 50 algorithms, graph embeddings, and support for supervised and unsupervised ML. Visualization, schema design, and data source connection tools are also available.
Ontotext
is based in Bulgaria and provides , a native triplestore supporting RDF* and SPARQL*. Reasoning and inference for , OWL Lite, and OWL2 RL/QL are supported via materialization of triples at load time. Virtualization over RDBMS systems is supported, enabling queries against relational databases.
The Ontotext Platform adds capabilities for schema creation, text processing pipelines, and platform deployment via Kubernetes. Developers can use interfaces to overcome the need for queries.
OpenLink Software
offers , a multimodel DBMS supporting both relational and data. Virtuoso enables data virtualization and hosts the , a collection of datasets accessible as a knowledge graph on the web. The platform supports inference on subclass and subproperty constructs, enabling graph traversal.
Oracle
entered the graph DBMS market in 2009 with graph capabilities in its multimodel Oracle DBMS. It supports and . The Oracle DBMS includes a library of graph algorithms and supports ML with frameworks like pandas and NumPy.
Redis Labs
offers , a graph database module for Redis supporting the query language. It integrates with for ML applications.
SAP
supports property graphs with query via , , and SQL. It includes a library of graph algorithms and integrates with .
Stardog
provides the Stardog Enterprise Knowledge Graph Platform, a triplestore supporting RDF* and SPARQL*. It supports in-database ML and virtual graphs to represent data without duplication.
TIBCO Software
offers , supporting property graphs and querying with . It integrates with .
TigerGraph
supports labeled property graphs with its . It features prebuilt schemas and tools for ML and analytics.