Lately, I've been diving deep into topics like Data Warehouses, Data Lakes, Data Modeling, Wide Tables, and the list goes on and on.
I've been curious about the history of dimensional modeling and why it still finds relevance in certain contexts.
While there are debates about the relevance of dimensional modeling in today's data landscape, I don't necessarily agree with the notion that it's no longer valuable.
Certainly, cloud computing has transformed the landscape, and modern computers have faster processing speeds, but there are still scenarios where I've found dimensional modeling to be highly beneficial. For instance, while working with reporting data in PowerBI, I've observed that querying is faster with a dimensional model compared to using a single, wide table (special thanks to Datacamp for teaching me how to measure this!). If you're interested in exploring dimensional models and their advantages in PowerBI, I recommend checking out this article.
What is Dimensional Modelling?
At its most basic form, Dimensional Modeling involves breaking down one large table into multiple tables, known as fact and dimension tables.
A typical fact table contains facts and dimension table keys for the purpose of referential integrity.
A typical dimension table contains the dimensions and a dimension key, which are used as references in the fact table.
What are facts?
For a long time, I was genuinely curious about what constitutes a fact and whether it is left to the modeler to decide what qualifies as a fact or a dimension.
After researching articles and books, I have finally arrived at a definition that works well for me, and I hope it will work well for you too:
Facts are the attributes in a table that are numeric and additive. These attributes can also change continuously and are not fixed.
In the Christmas Flight facts table above, the facts consist of ticket sales and ticket units (the number of tickets sold) within a single transaction. These facts are numeric and additive.
However, there are situations in which facts are semi-additive or non-additive. Exploring these cases is beyond the scope of this article, but I recommend this book to learn more about them.
What are dimensions?
Dimensions are the attributes of a table that provide textual context. They remain constant, are discrete, and are drawn from a limited list. They describe the 'who, what, where, when, how, and why' associated with a transaction.
One helpful tip that has guided me is to examine data in a table that is discrete and cannot be summed. This is a cue that the data is likely to be a dimension.
In the table above, the customer key acts as the unique identifier (primary key) for each row in the dimension table. It is used to associate the dimension table with the fact table and provide referential integrity.
The first name and last name provide textual context for each transaction. Another usefulness of dimensions is that they can be used for grouping and filtering.
In this case, the data can be filtered with the customer names or the flight transactions can be grouped by each customer's name.
Getting familiar with facts and dimensions is a great starting point for diving into dimensional modeling and exploring various schemas, such as the star and snowflake schemas, which differ based on their degrees of normalization. At the core, these schemas are built upon the concepts of facts and dimensions. I hope this article helps kickstart your dimensional modeling journey!