Introduction to Multidimensional Databases
Introducing the terms and concepts around multidimensional databases and data warehouses
8 min read
The multidimensional data model was introduced to address the limitations of (the ) which excels at handling “online” transactions but struggles with complex analytical queries, especially at a large scale. Designed specifically for data analysis, the multidimensional model is now a fundamental component of the Business Intelligence (BI) industry, enabling organizations to extract meaningful insights from vast datasets.
In a multidimensional model, data is organized as facts (quantifiable events) and dimensions (contextual descriptors). Facts typically have associated numerical values, known as measures, while dimensions are often textual, categorizing or qualifying the facts. Queries aggregate measure values over dimension ranges to provide actionable insights, such as "sales by month and city."
Components of a Multidimensional Data Model
Facts
Facts represent the core data points being analyzed. They usually represent business events, such as sales transactions, orders, or shipments. A fact is defined by its measurements (numerical values) and dimension values (contextual information). There are mainly two types of facts:
- Event Facts: they usually model events in the real world; meaning that there is exactly one fact for each unique event in the overall real-world process that is captures. For example with a fine granularity there will be one fact for each particular sale of a specific item; with a coarser granularity there would be one fact for each day when one or more sale of a specific item occurred If event facts did not have any measures they are called measurelss facts or factless facts, for example the record that represents a contact with a customer to inform him about a new offer.
- Snapshot Facts (state Facts): models the stats of a given process at a certain point in time.
Event facts are made to represent events that can happen unpredictably in time; however snapshot facts are typically made to represent the states at predefined times.
Dimensions
Dimensions are descriptive elements that provide context to facts. They categorize and describe the facts and are used for filtering and grouping data.
Dimensions are usually hierarchical, allowing users to drill down into finer details or roll up to see higher-level summaries. They are used for three purposes; the labeling, selection(filtering) and the grouping of data at a desired level of detail (they are often textual and considered as labels for data).
A dimension is organized into a containment-like hierarchy composed of number of levels, each of which represents a specific level of details. The instances of the dimensions are typically called dimension values or members; each value or member belongs to a particular level.
Measures
A measure has two components, a numerical property and a formula (usually an aggregation function such as sum or average). Measures generally represent the properties of a chosen fact. There are three classes of measures:
- Additive measures: can be combined meaningfully along any dimension (ex: sales over location and time) this causes no overlap between the real world phenomena that caused the individual values.
- Semi-additive measures: values cannot be combined along one or more of the dimensions (usually the time dimension); for example it doesn’t make sense to sum inventory over time as the same inventory item may be counted several times.
- Non-additive measures: Cannot be combined/summed along any dimension usually because of the chosen formula; for example when averages for lower level values cannot be combined with averages of higher level values.
Granularity
Granularity refers to the level of detail captured by the facts in a cube. The finer the granularity, the more detailed the data. For instance, sales data can be captured at different levels of granularity, such as per transaction, per day, or per month. The choice of granularity affects the level of detail in the analysis and the performance of the cube.
Cubes
A cube is a multidimensional data structure for capturing and analyzing data. It generalizes the tabular spreadsheet such as there can be any number of dimensions (in contrast to only two in the tabular spreadsheets). The terms hypercube and cube are used interchangeably as sometimes the term cube can infer that it hold only three dimensions. A collection of related cubes is commonly referred to as a multidimensional database or multidimensional data warehouse.
A cube consists of a set of cells at each of the dimensions intersections; a non empty cell is a fact. A fact has a number of measures associated with it. A sparse cube is one with high percentage of empty cells; a dense cube is one with high percentage of facts. Cubes tend to become increasingly sparse with increasing dimensionality and with increasingly finer granularities of the dimension value.
Operations in a Multidimensional Data Model
Multidimensional data models allow for a range of powerful operations that make it easier to explore data and extract insights:
- Rolling up: Rolling up causes the data view to go up to a higher cross grained view
- Drilling down and Drilling Out: The opposite of rolling up, the data view becomes more fine grained and detailed it offers more details by expanding one or more dimensions from a non-T level to a lower level. Drilling out occurs when a drill down is done by including an additional dimension at a non-t level, after a drill out the measure values are spread out among more cells.
- Slicing and Dicing: This happens when an analyst wishes to consider a subset of the cube, so he selects a specific value for a dimension; this is a slice. It is possible to slice the resulting slice further in what is called a dice. Slicing generally refer to filtering out data, and dicing refers to grouping out the filtered data.
- Drill Across: This is done when we do operation on more than one cube that share one or more conformed dimensions. The data from these cubes is combines by these shared dimensions, this in relational terms corresponds to a Full Outer-join.
- Pivot: allows an analyst to rotate the cube in space to see its various face
Multidimensional Representations
There are two principal ways of representing dimensions:
Star Schema
A star schema has one dimension table for each dimension, this table has a key column and one column for each level of the dimension. Furthermore, a star schema has a Fact Table that hold a row for each multidimensional fact, it has a column for each dimension. They primary key in the dimension tables is typically a surrogate key (ID), this will result in better storage, prevention of key-reuse problems and more efficient query processing. However, we notice that there is redundancy in higher level data, however it is not a problem space-wise and it does not cause any update problems thus it is often good idea to use such redundant dimensions tables to optimize performance.
Snowflake Schema
It is much similar to the star schema, however it contains several dimension tables for each dimension; this will result in removing the redundancy found in star schemas. The result of this is that querying the schema is now harder since several joins must be applied; it is also more time consuming to compute the results.
Data warehousing
A is a centralized repository that integrates data from various operational systems like ERP, CRM, and others. It is structured for the sole purpose of data analysis, unlike operational systems that focus on day-to-day transactions.
Subsets of a data warehouse are called , which focus on specific business functions (e.g., sales or marketing). Data is fed into the warehouse via the process:
- Extract: Data is gathered from diverse sources (e.g., operational databases).
- Transform: Data is cleaned, unified, and transformed into the warehouse format.
- Load: The transformed data is then loaded into the data warehouse.
ETL plays a pivotal role in ensuring data quality and consistency, enabling organizations to analyze integrated data efficiently. This process is sometimes referred to as ETT (Extract-Transform-Transport) when the final step focuses more on transporting data to different destinations.
Systems: Systems that provide fast answers to queries that aggregate large amounts of data to find overall trends; the results are presented in multidimensional fashion. As opposed to the well known the focus is on data analysis rather than transactions.Moreover, when doing OLTP the focus is on a single transaction whereas OLAP systems keep track of each individual transaction. OLAP systems generally never delete nor update its data; only additions of new data takes place periodically, thus OLAP systems are optimized for retrieving and summarizing large amounts of data. OLAP systems come mainly in three broad categories:
- : systems that are based on relational database management technology
- : systems that utilizes non-relational multidimensional technologies
- : a hybrid systems that combines these technologies