Content deleted Content added
Undid revision 861249092 by 2409:4042:2004:7A98:DD73:F84C:C025:8CCC (talk); not an improvement |
|||
Line 61:
{{NPOV section|date=June 2018}}
We still get the benefits of dimensional models on [[Apache Hadoop|Hadoop]] and similar big data frameworks. However, some features of Hadoop require us to slightly adapt the standard approach to dimensional modelling.<ref>{{Cite web|url=https://sonra.io/2017/05/15/dimensional-modeling-and-kimball-data-marts-in-the-age-of-big-data-and-hadoop/|title=Dimensional Modeling and Kimball Data Marts in the Age of Big Data and Hadoop|last=Uli|first=Bethke|date=2017-05-15|website=Sonra|access-date=2017-05-22}}</ref>
* The [[Apache Hadoop#HDFS|Hadoop File System]] is [[Immutable object|immutable]]. We can only add but not update data. As a result we can only append records to dimension tables. [[Slowly changing dimension|Slowly Changing Dimensions]] on Hadoop become the default
* The way data is distributed across HDFS makes it expensive to join data. In a distributed relational database ([[Massively parallel|MPP]]) we can co-locate records with the same primary and foreign keys on the same node in a cluster. This makes it relatively cheap to join very large tables. No data needs to travel across the network to perform the join. This is very different on Hadoop and HDFS. On HDFS tables are split into big chunks and distributed across the nodes on our cluster. We don’t have any control on how individual records and their keys are spread across the cluster. As a result joins on Hadoop for two very large tables are quite expensive as data has to travel across the network. We should avoid joins where possible. For a large fact and dimension table we can de-
== Literature ==
|