Dimensional modeling: Difference between revisions

Content deleted Content added
AnomieBOT (talk | contribs)
m Dating maintenance tags: {{Cn}}
OAbot (talk | contribs)
m Open access bot: url-access=subscription updated in citation with #oabot.
 
(32 intermediate revisions by 23 users not shown)
Line 1:
{{Short description|Data modeling concept}}
{{Multiple issues|
{{Page numbers improve|date=June 2018}}
{{How-to|date=April 2025}}}}
{{Use dmy dates|date=July 2018}}
'''Dimensional modeling''' (DM) is part of the ''[[The Kimball Lifecycle|Business Dimensional Lifecycle]]'' methodology developed by [[Ralph Kimball]] which includes a set of methods, techniques and concepts for use in [[data warehouse]] design.<ref name="ConBegg9"/>{{rp|1258-12601258–1260}}<ref name="MoodyKokink-1">{{cite web|url=http://neumann.hec.ca/sites/cours/6-060-00/MK_entreprise.pdf|title=From Enterprise Models to Dimensional Models: A Methodology for Data Warehouse and Data Mart Design|id=Dimensional Modelling|access-date=3 July 2018|first1=Daniel L.|last1=Moody|first2=Mark A.R.|last2=Kortink|dead-url-status=nolive|archive-url=https://web.archive.org/web/20170517164505/http://neumann.hec.ca/sites/cours/6-060-00/MK_entreprise.pdf|archive-date=17 May 2017|df=dmy-all}}</ref> The approach focuses on identifying the key [[business processesprocess]]es within a business and modelling and implementing these first before adding additional business processes, as a [[Top-down and bottom-up design|bottom-up approach]].<ref name="ConBegg9"/>{{rp|1258-12601258–1260}} An alternative approach from [[Bill Inmon|Inmon]] advocates a top down design of the model of all the enterprise data using tools such as [[entity-relationship model]]ing (ER).<ref name="ConBegg9">{{cite book|title=Database Systems - A Practical Approach to Design, Implementation and Management|first1=Thomas|last1=Connolly|first2=Carolyn|last2=Begg|publisher=Pearson|isbn=978-1-292-06118-4|edition=6th|at=Part 9 Business IntellenceIntelligence|date=26 September 2014|df=dmy-all}}</ref>{{rp|1258-12601258–1260}}
 
==Description==
 
Dimensional modeling always uses the concepts of facts (measures), and dimensions (context). Facts are typically (but not always) numeric values that can be aggregated, and dimensions are groups of hierarchies and descriptors that define the facts. For example, sales amount is a fact; timestamp, product, register#, store#, etc. are elements of dimensions. Dimensional models are built by business process area, e.g. store sales, inventory, claims, etc. Because the different [[business process areasarea]]s share some but not all dimensions, efficiency in design, operation, and consistency, is achieved using [[Dimension (data warehouse)#Types|conformed dimensions]], i.e. using one copy of the shared dimension across subject areas.{{cn|date=July 2018}}
 
Dimensional modeling does not necessarily involve a relational database. The same modeling approach, at the logical level, can be used for any physical form, such as multidimensional database or even flat files. It is oriented around understandability and performance.{{cn|date=July 2018}}
Line 23 ⟶ 26:
 
;Choose the business process:
The process of dimensional modeling builds on a 4-step design method that helps to ensure the usability of the dimensional model and the use of the [[data warehouse]]. The basics in the design build on the actual business process which the [[data warehouse]] should cover. Therefore, the first step in the model is to describe the business process which the model builds on. This could for instance be a sales situation in a retail store. To describe the business process, one can choose to do this in plain text or use basic [[Business Process ModelingModel and Notation]] ([[BPMN]]) or other design guides like the Unified Modeling Language ([[Unified Modeling Language|UML]] |UML).
 
;Declare the grain:
Line 34 ⟶ 37:
After defining the dimensions, the next step in the process is to make keys for the fact table. This step is to identify the numeric facts that will populate each fact table row. This step is closely related to the business users of the system, since this is where they get access to data stored in the [[data warehouse]]. Therefore, most of the fact table rows are numerical, additive figures such as quantity or cost per unit, etc.
 
=== Dimension Normalizationnormalization ===
{{NPOV section|date=June 2018}}
Dimensional normalization or snowflaking removes redundant attributes, which are known in the normal flatten de-normalized dimensions. Dimensions are strictly joined together in sub dimensions.
Line 52 ⟶ 54:
 
== Benefits of dimensional modeling ==
'''Commonly cited benefits of dimensional modeling include:'''<ref name="kimball2013">{{cite book |last1=Kimball |first1=Ralph |last2=Ross |first2=Margy |title=The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling |edition=3rd |year=2013 |publisher=Wiley |isbn=9781118530801 |page=43 |url=https://ia801609.us.archive.org/14/items/the-data-warehouse-toolkit-kimball/The%20Data%20Warehouse%20Toolkit%20-%20Kimball.pdf}}</ref>
{{third-party|section|date=June 2018}}
Benefits of the dimensional model are the following:<ref name="refname5"/>
* Understandability. Compared to the normalized model, the dimensional model is easier to understand and more intuitive. In dimensional models, information is grouped into coherent business categories or dimensions, making it easier to read and interpret. Simplicity also allows software to navigate databases efficiently. In normalized models, data is divided into many discrete entities and even a simple business process might result in dozens of tables joined together in a complex way.
* Query performance. Dimensional models are more denormalized and optimized for data querying, while normalized models seek to eliminate data redundancies and are optimized for transaction loading and updating. The predictable framework of a dimensional model allows the database to make strong assumptions about the data which may have a positive impact on performance. Each dimension is an equivalent entry point into the fact table, and this symmetrical structure allows effective handling of complex queries. Query optimization for star-joined databases is simple, predictable, and controllable.
* Extensibility. Dimensional models are scalable and easily accommodate unexpected new data. Existing tables can be changed in place either by simply adding new data rows into the table or executing SQL alter table commands. No queries or applications that sit on top of the data warehouse need to be reprogrammed to accommodate changes. Old queries and applications continue to run without yielding different results. But in normalized models each modification should be considered carefully, because of the complex dependencies between database tables.
 
* '''Understandability and simplicity.''' Dimensional models organize data by business processes and shared business terms (dimensions), which makes schemas easier for analysts to navigate than highly normalized designs.<ref name="kimball2013" />
== Dimensional Models, Hadoop, and Big Data ==
 
{{NPOV section|date=June 2018}}
* '''Query performance for analytic workloads.''' Star-schema queries typically join a large fact table to a few small dimensions; many systems implement star-join optimizations, and benchmarks specifically evaluate this workload (e.g., the Star Schema Benchmark).<ref name="kimball2013" /><ref name="ssb2009">{{cite conference |last1=O'Neil |first1=Patrick |last2=O'Neil |first2=Elizabeth |last3=Chen |first3=Xuedong |last4=Revilak |first4=Stephen |title=The Star Schema Benchmark and Augmented Fact Table Indexing |book-title=Performance Evaluation and Benchmarking (TPCTC 2009) |year=2009 |publisher=Springer |doi=10.1007/978-3-642-10424-4_17 |url=https://link.springer.com/chapter/10.1007/978-3-642-10424-4_17|url-access=subscription }}</ref>
We still get the benefits of dimensional models on [[Apache Hadoop|Hadoop]] and similar big data frameworks. However, some features of Hadoop require us to slightly adapt the standard approach to dimensional modelling.<ref>{{Cite web|url=https://sonra.io/2017/05/15/dimensional-modeling-and-kimball-data-marts-in-the-age-of-big-data-and-hadoop/|title=Dimensional Modeling and Kimball Data Marts in the Age of Big Data and Hadoop|last=Uli|first=Bethke|date=2017-05-15|website=Sonra|access-date=2017-05-22}}</ref>
 
* The [[Apache Hadoop#HDFS|Hadoop File System]] is [[Immutable object|immutable]]. We can only add but not update data. As a result we can only append records to dimension tables. [[Slowly changing dimension|Slowly Changing Dimensions]] on Hadoop become the default behaviour. In order to get the latest and most up to date record in a dimension table we have three options. First, we can create a [[View (SQL)|View]] that retrieves the latest record using [[Select (SQL)#Window function|windowing functions]]. Second, we can have a compaction service running in the background that recreates the latest state. Third, we can store our dimension tables in mutable storage, e.g. HBase and federate queries across the two types of storage.
* '''Extensibility (resilience to change).''' New facts or dimensions can be added without breaking existing queries so long as the fact-table grain is preserved; this allows incremental evolution of the warehouse.<ref name="kimball2013" />
* The way data is distributed across HDFS makes it expensive to join data. In a distributed relational database ([[Massively parallel|MPP]]) we can co-locate records with the same primary and foreign keys on the same node in a cluster. This makes it relatively cheap to join very large tables. No data needs to travel across the network to perform the join. This is very different on Hadoop and HDFS. On HDFS tables are split into big chunks and distributed across the nodes on our cluster. We don’t have any control on how individual records and their keys are spread across the cluster. As a result joins on Hadoop for two very large tables are quite expensive as data has to travel across the network. We should avoid joins where possible. For a large fact and dimension table we can de-normalise the dimension table directly into the fact table. For two very large transaction tables we can nest the records of the child table inside the parent table and flatten out the data at run time.
 
* '''Integration and consistency across subject areas.''' Reusable '''conformed dimensions''' enable consistent cross-process analysis and reduce duplication in future projects.<ref name="bus">{{cite web |title=Enterprise Data Warehouse Bus Architecture |website=Kimball Group |url=https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/kimball-data-warehouse-bus-architecture/ |access-date=2025-08-15}}</ref><ref name="conformed">{{cite web |title=Conformed Dimensions |website=Kimball Group |url=https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/conformed-dimension/ |access-date=2025-08-15}}</ref>
 
* '''Support for time-variant analysis.''' Techniques for '''slowly changing dimensions''' record attribute history so that analyses reflect the state of a dimension member at the time of each fact.<ref name="scd">{{cite web |title=Slowly Changing Dimensions |website=Kimball Group |date=2008-08-07 |url=https://www.kimballgroup.com/2008/08/slowly-changing-dimensions/ |access-date=2025-08-15}}</ref>
 
== Dimensional Modelsmodels, Hadoop, and Bigbig Datadata ==
We still get the benefits of dimensional models on [[Apache Hadoop|Hadoop]] and similar [[big data]] frameworks. However, some features of Hadoop require us to slightly adapt the standard approach to dimensional modelling.<ref>{{Cite webcn|urldate=https://sonra.io/2017/05/15/dimensional-modeling-and-kimball-data-marts-in-the-age-of-big-data-and-hadoop/|title=DimensionalMay Modeling and Kimball Data Marts in the Age of Big Data and Hadoop|last=Uli|first=Bethke|date=2017-05-15|website=Sonra|access-date=2017-05-222019}}</ref>
* The [[Apache Hadoop#HDFS|Hadoop File System]] is [[Immutable object|immutable]]. We can only add but not update data. As a result we can only append records to dimension tables. [[Slowly changing dimension|Slowly Changing Dimensions]] on Hadoop become the default behaviourbehavior. In order to get the latest and most up to date record in a dimension table we have three options. First, we can create a [[View (SQL)|View]] that retrieves the latest record using [[Select (SQL)#Window function|windowing functions]]. Second, we can have a compaction service running in the background that recreates the latest state. Third, we can store our dimension tables in mutable storage, e.g. HBase and federate queries across the two types of storage.
* The way data is distributed across HDFS makes it expensive to join data. In a distributed relational database ([[Massively parallel|MPP]]) we can co-locate records with the same primary and foreign keys on the same node in a cluster. This makes it relatively cheap to join very large tables. No data needs to travel across the network to perform the join. This is very different on Hadoop and HDFS. On HDFS tables are split into big chunks and distributed across the nodes on our cluster. We don’t have any control on how individual records and their keys are spread across the cluster. As a result joins on Hadoop for two very large tables are quite expensive as data has to travel across the network. We should avoid joins where possible. For a large fact and dimension table we can de-normalisenormalize the dimension table directly into the fact table. For two very large transaction tables we can nest the records of the child table inside the parent table and flatten out the data at run time.
 
== Literature ==
Line 79 ⟶ 86:
|title = Identifying Business Processes
|year = 2005
|volume =
|issue = 69
|author = Margy Ross (Kimball Group)
|url = http://www.kimballgroup.com/2005/07/05/design-tip-69-identifying-business-processes/
|journal = Kimball Group, Design Tips
|deadurlurl-status = yesdead
|archiveurl = https://web.archive.org/web/20130612211852/http://www.kimballgroup.com/2005/07/05/design-tip-69-identifying-business-processes/
|archivedate = 2013-06-12
|df =
}}
 
Line 96 ⟶ 101:
 
<ref name="refname4">{{cite book|author1=Ralph Kimball |author2=Margy Ross |title=The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling|edition=Second |date=April 26, 2002|publisher=Wiley|ISBN=0-471-20024-7}}</ref>
<!-- <ref name="refname5">{{cite book|author1=Ralph Kimball |author2=Margy Ross |author3=Warren Thornthwaite |author4=Joy Mundy |author5=Bob Becker |title=The Data Warehouse Lifecycle Toolkit |edition=Second |date=January 2008|publisher=Wiley|ISBN= 978-0-470-14977-5}}</ref> -->
}}