Introduction to Semantic Web
A gentle introduction to the new paradigm in the Web
14 min read
This post is part of
A Semantic Web series
- Introduction to Semantic Web
- Everything You Need to Know About RDF
The model behind the Web could be roughly summarized as a way to publish documents represented in a standard way (), containing links to other documents and accessible through the Internet using standard protocols (/ and ). The result is a worldwide, distributed file system of interconnected documents humans can read, exchange and discuss.
tldr;
- Before the Web, people used to write documents, cite references and then check the reference and go and look search for it in the library or look in the library ... etc.
- The great invention of the Web is the hyperlink; click on that link, and you get to the next document in the chain .. you can easily go to the reference !! so the web 1.0 was the web of documents
- Web 2.0 was application silos .. social stuff .. it is not only about the data but the problem is that these systems do not interoperate (update Facebook profile doesn't affect your linkedin ) .. data are not linked -> this was not only in the Web but also inside enterprise data
- is all about connecting the data .. not the documents but the data at lower levels
In summary, the great advantage of was that it abstracted away the physical storage and networking layers involved in information exchange between two machines. This breakthrough enabled documents to appear to be directly connected to one another. Click a link, and you're there—even if that link goes to a different document on a different machine on another network on another continent! So, in the same way that Web 1.0 abstracts away the network and physical layers, the Semantic Web abstracts away the document and application layers involved in the exchange of information.
The Semantic Web connects facts so that rather than linking to a specific document or application, you can instead refer to a specific piece of information contained in that document or application. If that information is ever updated, you can automatically take advantage of the update.
The word semantic itself implies meaning or understanding. As such, the fundamental difference between Semantic Web technologies and other technologies related to data (such as or the World Wide Web itself) is that the Semantic Web is concerned with the meaning and not the structure of data. This fundamental difference engenders an entirely different outlook on how storing, querying, and displaying information might be approached. Some applications, such as those that refer to a large amount of data from many different sources, benefit enormously from this feature. Others, such as storing high volumes of highly structured transactional data, do not.
What "semantic" means in the Semantic Web is not that computers will understand the meaning of anything but that the logical pieces of meaning can be mechanically manipulated by a machine to valuable ends.
So now imagine a new Web where the actual content can be manipulated by computers. For now, picture it as a web of databases. One "semantic" website publishes a database about a product line, with products and descriptions, while another publishes a database of product reviews. A third site for a retailer publishes a database of products in stock. What standards would make writing an application to mesh distributed databases together easier so that a computer could use the three data sources to help an end user make better purchasing decisions?
The semantic Web does not deal with unstructured content; instead, it represents not only structured data and links but also the meaning of the underlying concepts and relationships. Nothing stops anyone from writing a program now to do those sorts of things, just like nothing stopped anyone from exchanging data before we had XML. But standards facilitate building applications, especially in a decentralized system.
The Semantic Web addresses the discoverabiligy challenges through the adoption of distinct identifiers for concepts and their relationships. These identifiers, referred to as Universal Resource Identifiers (URIs), resemble web page URLs but are not restricted to identifying web documents. Instead, their primary purpose is to provide unique identification for objects and concepts, as well as their interconnections.
Utilizing URIs significantly reduces the ambiguity in information. However, the Semantic Web takes it a step further by enabling concepts to be linked with hierarchical classifications. This enables the inference of new information based on an individual concept's classification and its relationships with other concepts. Achieving this involves the utilization of ontologies, which are structured hierarchies of concepts used to categorize individual concepts.
From a technical point of view, the Semantic Web consists of:
- Data Model: Resource Description Framework (RDF): The data modeling language for the Semantic Web. All Semantic Web information is stored and represented in the RDF. It is a flexible and abstract model meaning that there is more than one representation of RDF.
- Query Language: (): The query language of the Semantic Web. It is specifically designed to query data across various systems.
- Schema and Ontology Languages: - The schema language, or knowledge representation (KR) language, of the Semantic Web. They enable you to define concepts in a composable way so that these concepts can be reused as much and as often as possible. Composability means that each concept is carefully defined to be selected and assembled in various combinations with other concepts as needed for many different applications and purposes.
Semantic technologies represent a fairly diverse family of technologies that have existed for a long time and seek to help derive meaning from information. Some examples of semantic technologies include natural language processing (NLP), data mining, , category tagging, and semantic search. The goal of semantic technologies is separating signal from noise. Some examples of existing semantic technologies being used today include:
Natural-language processing (NLP): NLP technologies attempt to process unstructured text content and extract the names, dates, organizations, events, etc., discussed within the text. There are many extensions of NLP, and they include:
- Search: Semantic Search often requires NLP parsing of source documents. The specific technique used is Entity Extraction, which identifies proper nouns (e.g., people, places, companies) and other specific information to search. For example, consider the query, "Find me all documents that mention Barack Obama." Some documents might contain "Barack Obama," others "President Obama," and still others "Senator Obama." Extractors will map all these terms to a single concept when used correctly.
- Auto-categorization: Imagine you have 100,000 news articles and want to sort them based on specific criteria. That would take humans ages, but a computer can do it quickly.
- : Sentiment Analysis measures the "sentiment" of an article, typically meaning whether the article's tone is positive, negative, or neutral. This application of NLP technology is often used in conjunction with search, but it can also be used in other contexts, such as alerting. For example, a business owner might ask an application to "alert me when someone says something negative regarding my company on Facebook."
- Summarization: Often used in conjunction with research applications, summaries of topics are created automatically so that people do not have to wade through many long-winded articles (perhaps such as this one!).
- Question Answering: This is the new hot topic in NLP, as evidenced by Siri and Watson. However, long before these tools, we had Ask Jeeves (now Ask.com) and later Wolfram Alpha, which specialized in question-answering. The idea here is to ask a computer a question and have it answer you (Star Trek-style! "Computer…").
Data mining: Data mining technologies employ pattern-matching algorithms to tease trends and correlations within large data sets. Data mining can be used, for example, to identify suspicious and potentially fraudulent trading behavior in large databases of financial transactions.
Artificial intelligence or expert systems: AI or expert systems technologies automatically use elaborate reasoning models to answer complex questions. These systems often include machine-learning algorithms that can improve the system's decision-making capabilities over time.
Classification: Classification technologies use heuristics and rules to tag data with categories to help search and analyze information.
Semantic search: Semantic search technologies allow people to locate information by concept instead of keyword or keyphrase. With semantic search, people can easily distinguish between searching for John F. Kennedy, the airport, and John F. Kennedy, the president.
The main goal behind knowing these technologies is that they help us assemble the Semantic Web's building blocks. For example, NLP can extract structured data from unstructured documents (flat files like text documents). This data is then linked via Semantic Web technologies to other published data. This bridges the gap between documents (unstructured data) and structured data.
What does that all mean in practice?
The Semantic Web is a vision about an extension of the existing World Wide Web, which provides software programs with machine-interpretable metadata of the published information and data. In other words, we add further data descriptors to otherwise existing content and data on the Web. As a result, computers are able to make meaningful interpretations similar to the way humans process information to achieve their goals.
What’s behind the original vision of the Semantic Web comes under the umbrella of three things: Automation of information retrieval, the Internet of Things and Personal Assistants. You can read more about all three in the seminal article by Tim Berners-Lee, James Hendler and Ora Lassila, published in Scientific American: .
The ultimate ambition of the Semantic Web, as its founder Tim Berners-Lee sees it, is to enable computers to better manipulate information on our behalf. He further explains that, in the context of the Semantic Web, the word “semantic” indicates machine-processable or what a machine is able to do with the data. Whereas “web” conveys the idea of a navigable space of interconnected objects with mappings from URIs to resources.
Linked Open Data (LOD)
One of the most important movements in the Semantic Web community is Linked Open Data, which strives to expose and connect all of the world's data in a readily queryable and consumable form. Linked Data aims to publish structured data so that it can be easily consumed and combined with other Linked Data.
The Four Rules of Linked Data
In a way, Linked Data is the Semantic Web realized via four best practice principles.
- Use URIs as names for things. An example of a URI is any URL. For example,
http://assaf.website
is the URI that refers to Ahmad Assaf. - Use HTTP URIs so that people can look up those names.
- When someone looks up a URI, provide useful information using the standards such as RDF and SPARQL.
- Include links to other URIs so that they can discover more things.
The Four Rules Applied
- Instead of using application-specific identifiers—database keys, UUIDs, incremental numbers, etc.—you map them to a set of URIs. Each identifier must map to one single URI. For example, each row of those two tables is uniquely identifiable using its URI.
- Make your URIs dereferenceable. This roughly means making them accessible via HTTP as we do for every human-readable Web page. This is a crucial aspect of Linked Data: every row of our tables is now fetchable and uniquely identifiable anywhere on the Web.
- Have our web server reply with some structured data when invoked. This is the Semantic Web "juicy" part. Model your data with RDF. Here is where you must shift from a relational data model to a graph one.
- Once all the rows of our tables have been uniquely identified, made dereferenceable through HTTP, and described with RDF, the last step is providing links between different rows across different tables. The main aim here is to make those implicit links explicit before shifting to the Linked data approach.
Because of their linking, these datasets form a giant web of data or a knowledge graph, which connects a vast amount of descriptions of entities and concepts of general importance. This interconnectedness allows for the discovery of new relationships and insights that would be difficult to uncover otherwise.
Semantic Metadata
Searching websites using keywords has been a long-standing method for finding information online. However, we're just beginning to fully leverage the vast amount of knowledge available on the global web. New standards for Structured Data help inform search engines about the actual entities a webpage contains, beyond just keywords, transitioning from a web of keywords to a more powerful web of things.
From a Web of Keywords to a Web of Things
However, relying solely on words to determine the content of a page can be limiting. It restricts the potential use of the information on your page. Structured data on web pages embodies Tim Berners-Lee's vision of the semantic web, allowing machines, not just humans, to understand web knowledge.
Don't confuse 'web of things' with 'internet of things'. The 'internet of things' typically refers to connecting electronic devices over the web, not to embedding structured data in web pages.
Consider a typical web page, such as the one you're reading. A search engine typically analyzes the text to determine the content, identifying phrases like structured data and knowledge graph. While useful, these words alone don't tell the search engine about the real entities the page discusses. Examples of entities might include: People (e.g., names, photos, contact details), Places (e.g., businesses, cities, parks), Events (e.g., event listings with time and location), Articles on various topics, etc.
Search engines cannot automatically extract these entities and their associated facts from text alone. Structured data provides the necessary tools and standards to help search engines identify these entities, creating an index not just of keywords, but of the entities and facts they contain. This approach is powerful and opens new possibilities for how web information can be used.
Examples of Using Structured Data
Rich Snippets
Structured data enables rich snippets, where search engines display key facts about a search query directly in the results. For example, searching for "Mary Poppins" on Google shows not just a list of web pages, but also key information like the movie's IMDb rating, extracted because it was embedded as structured data.
Global Knowledge Graphs
Search engines like Google and Bing use structured data to build knowledge graphs. These graphs provide a rich set of facts about entities, such as information about William Shakespeare when you search for him. This data is partly sourced from structured data on web pages.
Structured Data Standards
To publish information in a way that search engines can understand the entities a page contains, we use structured data standards. While we won't dive into the specifics here, some key terms you should be familiar with include:
- : A standard vocabulary of entities defined by major search engines like Google.
- : Embedding structured data within HTML markup.
- : A syntax for embedding structured data, typically in the page header.
- : A version of RDF designed for embedding in HTML.
- Knowledge Graph: A large structured index of entities on the web, such as the Google Knowledge Graph.
Knowledge Graphs: The Next Frontier of the Semantic Web
Knowledge graphs emerged later but quickly became a key factor in the adoption of Semantic Web standards and the technologies that support them. They brought the Semantic Web concept into enterprises by using semantic metadata to enhance data and content management, breaking down silos and facilitating integration with various knowledge management practices.
Enterprise knowledge graphs employ ontologies to explicitly define the different conceptual models (such as schemas, taxonomies, and vocabularies) used within various systems. In enterprise data management terms, knowledge graphs are considered a sophisticated form of semantic reference data, comprising interconnected descriptions of entities like objects, events, or concepts.
Thus, knowledge graphs enable organizations to enhance their proprietary information by providing a global context for interpretation and serving as a resource for data enrichment.
You can read more information about Knowledge Graphs in my post here.