Dimensional modeling: Difference between revisions

Content deleted Content added
Undid revision 861249092 by 2409:4042:2004:7A98:DD73:F84C:C025:8CCC (talk); not an improvement
Line 61:
{{NPOV section|date=June 2018}}
We still get the benefits of dimensional models on [[Apache Hadoop|Hadoop]] and similar big data frameworks. However, some features of Hadoop require us to slightly adapt the standard approach to dimensional modelling.<ref>{{Cite web|url=https://sonra.io/2017/05/15/dimensional-modeling-and-kimball-data-marts-in-the-age-of-big-data-and-hadoop/|title=Dimensional Modeling and Kimball Data Marts in the Age of Big Data and Hadoop|last=Uli|first=Bethke|date=2017-05-15|website=Sonra|access-date=2017-05-22}}</ref>
* The [[Apache Hadoop#HDFS|Hadoop File System]] is [[Immutable object|immutable]]. We can only add but not update data. As a result we can only append records to dimension tables. [[Slowly changing dimension|Slowly Changing Dimensions]] on Hadoop become the default behaviourbehavior. In order to get the latest and most up to date record in a dimension table we have three options. First, we can create a [[View (SQL)|View]] that retrieves the latest record using [[Select (SQL)#Window function|windowing functions]]. Second, we can have a compaction service running in the background that recreates the latest state. Third, we can store our dimension tables in mutable storage, e.g. HBase and federate queries across the two types of storage.
* The way data is distributed across HDFS makes it expensive to join data. In a distributed relational database ([[Massively parallel|MPP]]) we can co-locate records with the same primary and foreign keys on the same node in a cluster. This makes it relatively cheap to join very large tables. No data needs to travel across the network to perform the join. This is very different on Hadoop and HDFS. On HDFS tables are split into big chunks and distributed across the nodes on our cluster. We don’t have any control on how individual records and their keys are spread across the cluster. As a result joins on Hadoop for two very large tables are quite expensive as data has to travel across the network. We should avoid joins where possible. For a large fact and dimension table we can de-normalisenormalize the dimension table directly into the fact table. For two very large transaction tables we can nest the records of the child table inside the parent table and flatten out the data at run time.
 
== Literature ==