Everything You Need to Know About RDF
A comprehensive guide to the Resource Description Framework
27 min read
This post is part of
A Semantic Web series
- Introduction to Semantic Web
- An Introduction to Knowledge Graphs
- Everything You Need to Know About RDF
- How to Query RDF: A Comprehensive Guide to SPARQL
In today's data-driven world, efficiently managing and representing complex information is crucial. The is a powerful tool designed to meet this need. is a framework for describing and interchanging metadata and knowledge on the web. It enables the structured representation of information, facilitating better data interoperability and integration across diverse systems.
Whether you're a web developer, data scientist, or just someone curious about how data can be organized and understood by machines, this comprehensive guide will walk you through everything you need to know about RDF. From its foundational concepts and components to practical examples and applications, we will explore how works, why it matters, and how you can leverage it to enhance your data projects. Get ready to dive into the world of and discover how it can transform the way you handle information.
Natural Language vs. Machine-Readable Data
In our daily lives, we effortlessly communicate complex ideas through natural language. For instance, consider the simple fact:
I (Ahmad) own a blog (http://assaf.website/blog)
This statement is straightforward for humans to understand and express. However, representing this information in a machine-readable format, such as —one of the most prevalent data representation languages—can take various forms:
These XML representations, while effective, may not align with the intuitive way humans structure information, typically in a Subject - Verb/Predicate/Action - Object
format. This realization led to the development of RDF, which adopts a three-part structure to represent knowledge:
- Resource: Refers to anything that can be uniquely identified with a URI, such as web pages, locations, individuals, or products.
- Description: Encompasses the attributes, characteristics, and relationships of resources.
- Framework: Provides the models, languages, and syntaxes for describing resources.
is an intuitive knowledge representation using directed graphs, where the subjects and objects are the nodes and the predicates are the edges of that graph. This statement that comprises of these three parts is called RDF-Triple where the resource is a URI or a blank (empty) node, the property is a URI and the object can be a URI, literal or a Blank Node. If we wish to transform this knowledge into the traditional relational model (tables) it will look like:
Subject | Predicate | Object |
---|---|---|
Ahmad | has a blog | assaf.website/blog |
In plain English, an statement states facts, relationships and data by linking resources of different kinds. With the help of an statement, just about anything can be expressed by a uniform structure, consisting of three linked data pieces.
Constituents of an RDF Triple
uses a graph-based model to represent knowledge. The main components are:
- Subject: The resource being described (e.g., a person, a webpage).
- Predicate: The property or relationship (e.g., "has a blog").
- Object: The value or resource linked to the subject (e.g., a URL, a name).
Together, these components form a directed graph where:
- Nodes represent subjects and objects.
- Edges represent predicates.
graph is a collection of triples. A node in a graph can be one of the following kinds:
URIs
A unique is assigned to any resource or thing that needs to be described. A URI can be a URL or some other kind of unique identifier. Unlike URLs, URIs do not necessarily enable access to the resource they describe, i.e, in most cases they do not represent actual web pages.
URI schemes are versatile and not limited to web addresses; they can also serve as identifiers for a wide range of entities, including telephone numbers, ISBN numbers, and geographic coordinates. In essence, we consider a URI as a way to uniquely identify a resource, and it can be utilized as either the focal point or the object in a statement. Once a URI is associated with a subject, it can be regarded as a resource, allowing us to make additional statements about it.
This concept of employing URIs to label 'entities' and their interconnections holds significant significance. This approach contributes significantly to establishing a global, distinctive naming system. The adoption of such a system effectively mitigates the problem of homonyms that has historically hindered the representation of distributed data.
A URI is a Unicode string that does not contain any control characters and would produce a valid URI character sequence representing an absolute URI with optional when subjected to the encoding thatconsists of:
- encoding the Unicode string as giving a sequence of octet values.
- %-escaping octets that do not correspond to permitted characters.
Literals
Simple Strings that describe data values that do not have a separate existence. They can be plain (simple string combined with an optional language tag) or typed (string combined with a datatype URI and an optional language tag). Typed Literals are expressed via the XML Schema data types. Whenever we are using URIs to describe things in we try as much as we can to reuse existing namespaces and for literals we use the XML Schema defined in . So for example if I want to define a literal as a I use the following syntax :
We should note that literals are wrapped with quotations marks. The hash after the XMLSchema
URI denotes the fragment identifier that points to String. In addition to these I can specify a language tag the describes the "natural" language of the text. For example, "Semantic"@en
which means that this literal is an English world.
Literals are written either using double-quotes when they do not contain linebreaks like "simple literal"
or "long literal"
when they may contain linebreaks. Datatypes are a bit tricky. Let's think of the datatype for . At an abstract level, the floating-point numbers themselves are different from the text we use to represent them on paper. For instance, the text “5.1” represents the number 5.1, but so does “5.1000” and “05.10”. Here there are multiple textual representations — what are called lexical representations — for the same value.
NOTE
RDF/XML uses the namespace mechanism of XML, but in an expanded way. In XML, namespaces are only used for disambiguation purposes. In RDF/XML, external namespaces are expected to be documents defining resources, which are then used in the importing document. This mechanism allows the reuse of resources by other people who may decide to insert additional features into these resources. The result is the emergence of large, distributed collections of knowledge
A datatype tells us how to map lexical representations to values, and vice versa. reuses the XML Schema (W3C)"
datatypes, including xsd:string, xsd:float, xsd:double, xsd:integer, and xsd:date
can also contain custom datatypes that (you guessed it!) are simply named with a URI. If you omit a datatype declaration it be considered as a plain literal by many tools, which is not the same thing as a string. However, as of 1.1, this distinction is going away, so going forward you should be able to treat "Rob Gonzalez"
and "Rob Gonzalez^^xsd:string
as equivalent, and many tools already do.
The semantics of takes language tags and datatypes into account. This means two things. First, a literal value without either a language tag or datatype is different from a literal with a language tag and is different from a literal with a datatype. These four statements say four different things and none can be inferred from the others:
So, an untyped literal with or without a language tag is not the same as a typed literal. The second part of the semantics of literals is that two typed literals that appear different may be the same if their datatype maps theirlexical representationsto the same value. The following statements are equivalent (at least for an application that has been given the semantics of the XSD datatypes):
These mean ahmad age is 20. That is, the textual representation of the number is besides the point and is not part of the meaning encoded by the triples. Note that if the float datatype were not specified, the triples would not be inherently equivalent, and the textual representation of the 20 would be maintained as part of the information content. Sometimes the value of a property needs to be a fragment of XML, or text that might contain XML markup. RDF/XML provides a special notation to make it easy to write literals of this kind. This is done using a third value of the rdf:parseType
attribute. Giving an element the attribute rdf:parseType="Literal"
indicates that the contents of the element are to be interpreted as an XML fragment
in Turtle notation the type used to describe XML literals is rdf:XMLLiteral
Blank Nodes
Subjects or Objects can be modeled as blank nodes. They denote the existence of an individual with specific attributes but without providing any information about identity orreference. A Ground Graph is a graph where there is no Blank Nodes.
RDF Representation
Resources can be in principle anything that must be uniquely identified and referenced by a URI. The Description of sources is done via representing properties and relationships among sources; this representation can be done in several ways:
Labeled Directed Graph
This is a visual way of modeling RD as a Node-Edge-Node Triple. Its directed as the direction of the edge is significant and always points towards the object.
Directionality matters in RDF triples. Defining reverse relationships is possible, but not always necessary We will see later in my post about Web Ontology Language (OWL), that there is a simple way to define the “opposite” property using the inverseOf
relationship. Although it's not strictly necessary to define reverse properties.
XML RDF Notation
Using XML syntax to represent triples. We should point out that we do use namespaces in order to minimize the writing that we have to do. So, if we are using many URIs repeatedly throughout our representation then its better to define some global namespace and start referring to these namespaces with a shorthand.
we notice that every triple is wrapped between the rdf:RDF
tags. Defining namesapces is done using the xmlns
(XML namespace) followed by :
and the name of the namespace and then the URI. URIs are represented with the <rdf:description>
tag and the URI for that resource should be identified in the rdf:about
attribute. In addition to that, custom properties are wrapped in tags named after the namesapce defined on top followed buy the path to the sepccific property <aa:hasBlog>
We can also define more than one property for each resource:
In an RDF/XML document there are two types of XML nodes:
- resource XML nodes: the subjects and objects of statements, and they usually are
rdf:Description
tags that have anrdf:about
attribute on them giving the URI of the resource they represent. In this example, the rdf:Description nodes are the resource nodes. - property XML nodes: nodes contain within them property XML nodes (and nothing else).
Each property XML node represents a single statement. The subject of the statement is the outer resource XML node that contains the property. We should also notice that we have used the rdf:datatype in order to specify the usage of certain types like Strings or Integers ... etc. The URL can be shortcut by using namespaces as well. We have the xml:base
attribute that defines a base URI that will be used globally in the XML representation. So for example, if I have identified a xml:base
tag as follows:
and I have changed my URI from http://assaf.website
to http://assaf.website/Administrators#AhmadAssaf
then in the XML I point out to that as:
instead of using rdf:about
and then put the hash # in front of the fragment URI, I can use the rdf:ID
which has the hash tag complemented in it so that I can write:
However, despite the fact that XML RDF is difficult to read and a bit expensive and not flexible, it is the standard for web documents as we can embed it simply as XML is supported by most browsers and parsers.
N3 Notation
Simple listing of triples. It is a shorthand non-XML serialization of models designed with Human readability in mind. It is more compact than XML but for complex and large models it can become very expensive.
N3has some syntactic sugar that allows further abbreviations. If many statements repeat the same subject and predicate, just separate the objects with commas:
And if the same subject is repeated, but with different predicates, one may use semicolons as in the example:
Turtle (Terse RDF Triple Language)
A simplified of the N3 notation. URIs are wrapped in angle brackets and Literals in quotations marks. Every triple ends up with a period and whitespaces or indentation will be ignored.
We have talked about namespaces and base URIs in XML. The same concepts are transfered into the turtle notation:
so in my example I will have:
Note that we have used a semi colon to express that we want to keep stating facts about the same subject which is <http://assaf.website>
We will go to more details about this notation a bit forward. In Turtle the referring to URIs in the base ID is done in a similar way of using the rdf:ID
tag as the hash is complemented with the usage of colons :
and the fragment URI ex: :AhmadAssaf
The current base URI may be altered in a Turtle document using the @base
directive. It allows further abbreviation of URIs but is usually for simplifying the URIs in the data, where the prefix directives are for vocabularies that describe the data. For example:
Notes
- Turtle strings and URIs can use -escape sequences to represent Unicode code points (t, n, .... etc. )
- Comments in Turtle take the form of
#
, outside a relative URI or strings, and continue to the end of line - The
,
symbol may be used to repeat the subject and predicate of triples that only differ in the object term.
- The
;
symbol may be used to repeat the subject of of triples that vary only in predicate and object terms.
The Framework is a combination of web based protocols (URI, HTTP, XML ... etc.) that are based on formal models and defines all the allowed relationships among resources.
Blank Nodes (B Nodes )
We have talked earlier about Blank Nodes or B Nodes. They are basically nodes that might not have names and are potentially un-referenceable. only allows binary relations, so it's necessary to express many-way relations using intermediate nodes, and these nodes are often anonymous.They are used usually to express collections, for example if I want to say that my blog is influenced by other resources, in natural language I can say that my blog at http://assaf.website/blog
is influenced by Blog A at http://www.BlogA.com
and Blog B at http://www.BlogB.com
If I want to represent this as a directed graph it will be difficult to distinguish which URL corresponds to which resources name ?! so now we have to think of a workaround to ensure unique representation and selection of resources. We can make this happen (representing multi valued relation) by introducing additional nodes in the graph:
The nodes in black are blank nodes that do not have a name and only act as a connection point.
Now how can we represent Blank Nodes in XML RDF and Turtle notation ?!
In XML RDF the blank nodes are denoted by therdf:parseType="Resource"
, in Turtle its much more easier:
we simply introduce blank nodes in Turtle by using the brackets[ ]
. So far we have known that Blank Nodes do not have name, this is not always the case as sometime I need to give a label or a name to that blank node in order to be able to reference it locally. In this example thenodeID
will follow the default base URI.
In Turtle we use the underscore_
followed by a colon that denotes the default namespace and then the Blank Node name
so Blank Nodes do NOT need be de-referencable and accessed from the outside world, however they can have IDs or names that will them to be referenced in an document or model.
Data Structures in RDF
In there exists some kind of collections (in computer science terms) that will allow us aggregate nodes or facts together. They are general data structures to enumerate any resource or literal. They are basically a syntactic sugar that will ease the process of writing code with no semantic expressiveness whatsoever. We have in two different aggregators:
Containers
An open list of elements possibly including duplicate members where new entries (additions) are possible Note that the container resource (which may either be a blank node or a resource with a URIref) denotes the group as a whole. The members of the container can be described by defining a container membership property for each member with the container resource as its subject and the member as its object. These container membership properties have names of the form rdf:_n
where n is a decimal integer greater than zero, with no leading zeros, e.g.,rdf:_1, rdf:_2, rdf:_3
and so on, and are used specifically for describing the members of containers. Container resources may also have other properties that describe the container, in addition to the container membership properties and the rdf:type
property.
We notice that collection node has a type of rdf:type rdf:seq which means that this is a sequential collection or list and the order of the items is important. Each item in the list have a unique property rdf:_1, rdf:_2 ...
etc. In Turtle, sequential collections are represented as:
Or it can be written as:
or written as:
In XML/RDF
RDF/XML provides rdf:li
as a convenience element to avoid having to explicitly number each membership property. The numbered properties rdf:_1, rdf:_2
and so on are generated from the rdf:li
elements in forming the corresponding graph. The element name rdf:li
was chosen to be mnemonic with the term "list item" from HTML we notice that we referenced our blog by using the default namespace using the colon :
and we have referred to the collection by the brackets []
(like Blank Nodes) but we have added the rdf:type``rdf:Seq
to that blank node. In a short hand for the rdf:type
is the letter a (like natural language in English when you say that this is a cat --> it means that this is sth with A type of cat ). We have known so far the Sequential container which denotes ordered list of elements, but we do also have another types of containers:
rdf:Bag
: This is an unordered list of elements possibly including duplicate members and there is no given order for elements.rdf:Alt
: Defines alternatives of elements and only one element of the given alternatives is relevant to the application ( a group of resources or literals that are alternatives (typically for a single value of a property) . An Alt container is intended to have at least one member, identified by the propertyrdf:_1
. This member is intended to be considered as the default or preferred value. Other than the member identified asrdf:_1
the order of the remaining elements is not significant.
Collections
These are closed lists where there is no extension possible. Elements of the list are already predefined. It really resembles the traditional list data structure with the fact that it is closed to insert operations. It is split recursively with Head (first) and Tail (rest), we end the list by linking to rdf:nil
Each of the blank nodes forming this list structure is implicitly of type rdf:List
that is, each of these nodes implicitly has an rdf:type
property whose value is the predefined type rdf:List
although this is not explicitly shown in the graph.
However, in Turtle there are shortcuts to ease the simplify the syntax. We do that by using a new type of brackets ( )
The collection of statements ( Triples) form together an graph. The assertion of an graph (assertion means that a term evaluates to true) amounts to the assertion of all triples in it, this means a conjunction (logical AND) between all the statement corresponding to all the triples in the graph.
The datastore that stores these triples is often called a triple store
RDF/XML provides a special notation to make it easy to describe collections using graphs of this form. In RDF/XML, a collection can be described by a property element that has the attribute rdf:parseType="Collection"
, and that contains a group of nested elements representing the members of the collection.RDF/XML provides the rdf:parseType
attribute to indicate that the contents of an element are to be interpreted in a special way. In this case, the rdf:parseType="Collection"
attribute indicates that the enclosed elements are to be used to create the corresponding list structure in the graph
Semantic Web practitioners found it very difficult to deal with large amounts of triples for application development. There are lots of reasons that you would want to segment different subsets of triples from each other (simplified access control, simplified updating, trust, etc.), and vanilla made segmentation tedious.
Named Graphs
At first the community tried using reification to solve this data segmentation problem, but today everyone has converged on using named graphs. A Named Graph is a collection of statement that are given an identifier (URI). When referring to triple in a name graph we often use a 4-tuple notation (often referred to as quad) instead of the standard 3-tuple one:
When using named graphs,TriG
is the de facto serialization. It's the same as Turtle except that statements in a single graph are grouped with{}
This trivial example puts all the statements in the document into a single named graph, egotistically called :blogGraph
Again, like all things in :blogGraph
is a URI. Looking at the 4-tuples, it's pretty obvious that the same statement can exist in multiple named graphs. This is by design and is a very important feature. By organizing the statement into named graphs, a Semantic Web application can implement access control, trust, data lineage, and other functionality very cleanly.
RDF Reification
applications sometimes need to describe other statements using RDF, for instance, to record information about when statements were made, who made them, or other similar information (this is sometimes referred to as "provenance" information). Moreover, sometimes we will come across use cases where we deduce facts in our model, and the deducted facts also need to be used and modeled (become the subject of a new statements). For example, if a detective (Sherlock Holmes) deduced that the gardener is the one who has killed the butler, then I might want to use this new discovered fact in a new statement. This is another use case of reification.
Reification allows interleaving of statements and making statements about other statements
provides a built-in vocabulary intended for describing statements. A description of a statement using this vocabulary is called a reification of the statement. The reification vocabulary consists of:
rdf:Statement
: describes an statement, consisting of the following properties:rdf:subject
: the described resourcerdf:predicate
: the original propertyrdf:object
: the value of the property
The use of reification makes it fairly simple to represent this information:
However, while provides this reification vocabulary, care is needed in using it, because it is easy to imagine that the vocabulary defines some things that are not actually defined.for example, lets take the statement:
Using the reification vocabulary, a reification of the statement about the blog's administrator would be given by assigning the statement a URIref such as :triple12345
(so statements can be written describing it), and then describing the statement using the statements:
The subject of these reification triples is a URI ref formed by concatenating the base URI of the document (given in the xml:base declaration), the character #
(to indicate that what follows is a fragment identifier), and the value of the rdf:ID
attribute. However you can generate the same graph (Reification) by using therdf:ID
In this case, specifying the attribute rdf:ID="triple12345"
in the aa:hasAdmins element results in the original triple describing the blog administrator:
plus the reification triples:
Reification Advantages
- Modeling data provenance
- Formalizing statements about Reliability and Trust
- Definition of metadata about statements (Assertions and Statements)
Wrap Up
- An Model is a set of statements
- An statement consists of subject, property, object
- Subjects and objects and resources while an object can be either a literal or a resource
So far we have talked about several representations (A serialization format is a way to encode information so that when it's passed between machines it can be parsed, XML is a serialization format, is a data model); however there are few more worth mentioning:
- TriG: TriG is Turtle but with support for named graphs. It's the de facto standard for serializing with named graphs.
- RDFa ( embedded in HTML): You can embed data within normal web pages by using RDFa.
- N-Triples: is a very basic serialization. Its key feature is that only one triple exists per line so that it's very quick to parse and so that Unix command-line tools can easily manipulate it. It's also highly compressible, so large, public sources like often publish data in N-Triples form.
What sets apart from XML is that is designed to represent knowledge in a distributed world. That is designed for knowledge, and not data, means is particularly concerned with meaning. Everything at all mentioned in means something. It may be a reference to something in the world, like a person or movie, or it may be an abstract concept, like the state of being friends with someone else. And by putting three such entities together, the standard says how to arrive at a fact. The second key aspect of is that it works well for distributed information. That is, applications can put together files posted by different people around the Internet and easily learn from them new things that no single document asserted. It does this in two ways, first by linking documents together by the common vocabularies they use, and second by allowing any document to use any vocabulary. This allows enormous flexibility in expressing facts about a wide range of things, drawing on information from a wide range of sources.
To distinguish between URIs, namespaced names (abbreviated URIs), anonymous nodes, and literal values, I used the following common convention
- Full URIs are enclosed in angle brackets.
- Namespaced names are written plainly, but their colons give them away.
- Anonymous nodes are written like namespaced names, but in the reserved "_" namespace with an arbitrary local name after the colon.
- Literal values are enclosed in quotation marks.