Content deleted Content added
m Open access bot: url-access=subscription updated in citation with #oabot. |
|||
(12 intermediate revisions by 10 users not shown) | |||
Line 1:
{{Short description|Data modeling concept}}
{{Multiple issues|
{{Page numbers improve|date=June 2018}}
{{How-to|date=April 2025}}}}
{{Use dmy dates|date=July 2018}}
'''Dimensional modeling'''
==Description==
Line 35 ⟶ 38:
=== Dimension normalization ===
Dimensional normalization or snowflaking removes redundant attributes, which are known in the normal flatten de-normalized dimensions. Dimensions are strictly joined together in sub dimensions.
Line 52 ⟶ 54:
== Benefits of dimensional modeling ==
'''Commonly cited benefits of dimensional modeling include:'''<ref name="kimball2013">{{cite book |last1=Kimball |first1=Ralph |last2=Ross |first2=Margy |title=The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling |edition=3rd |year=2013 |publisher=Wiley |isbn=9781118530801 |page=43 |url=https://ia801609.us.archive.org/14/items/the-data-warehouse-toolkit-kimball/The%20Data%20Warehouse%20Toolkit%20-%20Kimball.pdf}}</ref>
* '''Understandability and simplicity.''' Dimensional models organize data by business processes and shared business terms (dimensions), which makes schemas easier for analysts to navigate than highly normalized designs.<ref name="kimball2013" />
* '''Query performance for analytic workloads.''' Star-schema queries typically join a large fact table to a few small dimensions; many systems implement star-join optimizations, and benchmarks specifically evaluate this workload (e.g., the Star Schema Benchmark).<ref name="kimball2013" /><ref name="ssb2009">{{cite conference |last1=O'Neil |first1=Patrick |last2=O'Neil |first2=Elizabeth |last3=Chen |first3=Xuedong |last4=Revilak |first4=Stephen |title=The Star Schema Benchmark and Augmented Fact Table Indexing |book-title=Performance Evaluation and Benchmarking (TPCTC 2009) |year=2009 |publisher=Springer |doi=10.1007/978-3-642-10424-4_17 |url=https://link.springer.com/chapter/10.1007/978-3-642-10424-4_17|url-access=subscription }}</ref>
* '''Extensibility (resilience to change).''' New facts or dimensions can be added without breaking existing queries so long as the fact-table grain is preserved; this allows incremental evolution of the warehouse.<ref name="kimball2013" />
* '''Integration and consistency across subject areas.''' Reusable '''conformed dimensions''' enable consistent cross-process analysis and reduce duplication in future projects.<ref name="bus">{{cite web |title=Enterprise Data Warehouse Bus Architecture |website=Kimball Group |url=https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/kimball-data-warehouse-bus-architecture/ |access-date=2025-08-15}}</ref><ref name="conformed">{{cite web |title=Conformed Dimensions |website=Kimball Group |url=https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/conformed-dimension/ |access-date=2025-08-15}}</ref>
* '''Support for time-variant analysis.''' Techniques for '''slowly changing dimensions''' record attribute history so that analyses reflect the state of a dimension member at the time of each fact.<ref name="scd">{{cite web |title=Slowly Changing Dimensions |website=Kimball Group |date=2008-08-07 |url=https://www.kimballgroup.com/2008/08/slowly-changing-dimensions/ |access-date=2025-08-15}}</ref>
== Dimensional models, Hadoop, and big data ==
We still get the benefits of dimensional models on [[Apache Hadoop|Hadoop]] and similar [[big data]] frameworks. However, some features of Hadoop require us to slightly adapt the standard approach to dimensional modelling.{{cn|date=May 2019}}
* The [[Apache Hadoop#HDFS|Hadoop File System]] is [[Immutable object|immutable]]. We can only add but not update data. As a result we can only append records to dimension tables. [[Slowly changing dimension|Slowly Changing Dimensions]] on Hadoop become the default behavior. In order to get the latest and most up to date record in a dimension table we have three options. First, we can create a [[View (SQL)|View]] that retrieves the latest record using [[Select (SQL)#Window function|windowing functions]]. Second, we can have a compaction service running in the background that recreates the latest state. Third, we can store our dimension tables in mutable storage, e.g. HBase and federate queries across the two types of storage.
Line 94 ⟶ 101:
<ref name="refname4">{{cite book|author1=Ralph Kimball |author2=Margy Ross |title=The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling|edition=Second |date=April 26, 2002|publisher=Wiley|ISBN=0-471-20024-7}}</ref>
<!-- <ref name="refname5">{{cite book|author1=Ralph Kimball |author2=Margy Ross |author3=Warren Thornthwaite |author4=Joy Mundy |author5=Bob Becker |title=The Data Warehouse Lifecycle Toolkit |edition=Second |date=January 2008|publisher=Wiley|ISBN= 978-0-470-14977-5}}</ref> -->
}}
|