Data vault modeling: Difference between revisions

Content deleted Content added
Tags: Mobile edit Mobile web edit Advanced mobile edit
m Basic notions: heading & description list markup; Fixed incorrect references formatting – You can help!
 
(8 intermediate revisions by 7 users not shown)
Line 12:
 
== History and philosophy ==
{{Original research|discuss=Template talk:Original research#discuss parameter|date=August 2019}}
 
In its early days, Dan Linstedt referred to the modeling technique which was to become data vault as ''common foundational warehouse architecture''<ref>Building a scalable datawarehouse with data vault 2.0, p. 11</ref> or ''common foundational modeling architecture''.<ref>Building a scalable datawarehouse with data vault 2.0, p. xv</ref> In [[data warehouse]] modeling there are two well-known competing options for modeling the layer where the data are stored. Either you model according to [[Ralph Kimball]], with conformed dimensions and an [[Enterprise bus matrix|enterprise data bus]], or you model according to [[Bill Inmon]] with the database [[normal forms|normalized]].<ref>{{CitationCite web needed|date=August2020-02-03 2019|title=Data Warehouse Concepts: Kimball vs. Inmon Approach |url=https://www.astera.com/type/blog/data-warehouse-concepts/ |access-date=2024-10-02 |website=Astera |language=en-US}}.</ref> Both techniques have issues when dealing with changes in the systems feeding the data warehouse{{Citation needed|date=August 2019}}. For conformed dimensions you also have to cleanse data (to conform it) and this is undesirable in a number of cases since this inevitably will lose information{{Citation needed|date=August 2019}}. Data vault is designed to avoid or minimize the impact of those issues, by moving them to areas of the data warehouse that are outside the historical storage area (cleansing is done in the data marts) and by separating the structural items (business keys and the associations between the business keys) from the descriptive attributes.
 
Dan Linstedt, the creator of the method, describes the resulting database as follows:
Line 23:
Another issue to which data vault is a response is that more and more there is a need for complete auditability and traceability of all the data in the data warehouse. Due to [[Sarbanes-Oxley]] requirements in the USA and similar measures in Europe this is a relevant topic for many business intelligence implementations, hence the focus of any data vault implementation is complete traceability and auditability of all information.
 
''Data Vault 2.0'' is the new specification. It is an [[open standard]].<ref>[[#dvos2|A short intro to#datavault 2.0]]</ref> The new specification consists of three pillars: methodology ([[Software Engineering Institute|SEI]]/[[Capability Maturity Model|CMMI]], [[Six Sigma]], [[Systems development life cycle|SDLC]], etc..), the architecture (amongst others an input layer (data stage, called [[persistent staging area]] in Data Vault 2.0) and a presentation layer (data mart), and handling of data quality services and master data services), and the model. Within the methodology, the implementation of best practices is defined. Data Vault 2.0 has a focus on including new components such as [[big data]], [[NoSQL]] - and also focuses on the performance of the existing model. The old specification (documented here for the most part) is highly focused on data vault modeling. It is documented in the book: Building a Scalable Data Warehouse with Data Vault 2.0.<ref>{{Cite web |title=Building a Scalable Data Warehouse with Data Vault 2.0[Book] |url=https://www.oreilly.com/library/view/building-a-scalable/9780128026489/ |access-date=2024-10-02 |website=www.oreilly.com |language=en}}</ref>
 
It is necessary to evolve the specification to include the new components, along with the best practices in order to keep the EDW and BI systems current with the needs and desires of today's businesses.
 
=== History ===
Data vault modeling was originally conceived by Dan Linstedt in the 1990s and was released in 2000 as a public ___domain modeling method. In a series of five articles in The Data Administration Newsletter the basic rules of the Data Vault method are expanded and explained. These contain a general overview,<ref>[[#tdan1{{Cite web |date=2002-07-01 |title=Data Vault Series 1 – Data Vault Overview]] |url=https://tdan.com/data-vault-series-1-data-vault-overview/5054 |access-date=2024-10-02 |website=TDAN.com |language=en-US}}</ref> an overview of the components,<ref>[[#tdan2{{Cite web |date=2003-01-01 |title=Data Vault Series 2 – Data Vault Components]] |url=https://tdan.com/data-vault-series-2-data-vault-components/5155 |access-date=2024-10-02 |website=TDAN.com |language=en-US}}</ref> a discussion about end dates and joins,<ref>[[#tdan3|Data Vault Series 3 – End Dates and Basic Joins]]</ref> link tables,<ref>[[#tdan4|Data Vault Series 4 – Link tables]], paragraph 2.3</ref> and an article on loading practices.<ref name="DataVault_a">[[#tdan5|Data Vault Series 5 – Loading Practices]]</ref>
 
An alternative (and seldom used) name for the method is "Common Foundational Integration Modelling Architecture."<ref>[[#dwdummy|Data Warehousing for Dummies]], page 83</ref>
Line 42:
 
== Basic notions ==
Data Vault 2.0 organizes data into three core components that separate stable identifiers from changing descriptive attributes:<ref name="linstedt2015">{{cite book |last1=Linstedt |first1=Daniel |last2=Olschimke |first2=Michael |title=Building a Scalable Data Warehouse with Data Vault 2.0 |publisher=Morgan Kaufmann |year=2015 |isbn=9780128025109 |url=https://books.google.com/books/about/Building_a_Scalable_Data_Warehouse_with.html?id=lgDJBAAAQBAJ}}</ref>
Data vault attempts to solve the problem of dealing with change in the environment by separating the business keys (that do not mutate as often, because they uniquely identify a business entity) and the associations between those business keys, from the descriptive attributes of those keys.
 
* '''Hub''' – stores a unique business key for a core business concept together with minimal metadata for lineage/audit; it acts as an integration point across sources.<ref name="linstedt2015" />
The business keys and their associations are structural attributes, forming the skeleton of the data model. The data vault method has as one of its main axioms that real business keys only change when the business changes and are therefore the most stable elements from which to derive the structure of a historical database. If you use these keys as the backbone of a data warehouse, you can organize the rest of the data around them. This means that choosing the correct keys for the hubs is of prime importance for the stability of your model.<ref>[[#dvsuper|Super Charge your data warehouse]], page 61, why are business keys important</ref> The keys are stored in tables with a few constraints on the structure. These key-tables are called hubs.
* '''Link''' – captures the relationship (often many-to-many) between hubs; the participating hub keys define the grain of the relationship.<ref name="linstedt2015" />
* '''Satellite''' – contains descriptive attributes and their history associated with a hub or link; satellites are append-only so every change is preserved (similar in effect to Type-II history in dimensional models).<ref name="linstedt2015" />
 
Specialized satellites support temporal semantics. For example, an '''effectivity satellite''' on a link records begin/end dates representing when the relationship is considered effective by the business.<ref name="linstedt2015" />
 
=== Layers ===
*'''Raw Vault''' – a source-driven integration layer that retains granular, auditable history with minimal transformations.<ref name="linstedt2015" />
*'''Business Vault''' – a derived layer that applies business rules and query-assistance structures (e.g., PIT and bridge tables) to facilitate downstream consumption.<ref name="linstedt2015" />
 
=== Use with dimensional models ===
In practice, Data Vault commonly serves as the historical integration layer, while star-schema information marts are projected from the Raw/Business Vault for performant analytics and simpler user access.<ref name="hultgren">{{cite book |last=Hultgren |first=Hans |title=Modeling the Agile Data Warehouse with Data Vault |year=2012 |publisher=Brighton Hamilton |isbn=9780615723082 |url=https://searchworks.stanford.edu/view/11487017}}</ref><ref name="kimball2013">{{cite web |title=The Data Warehouse Toolkit, 3rd Edition |website=Wiley |url=https://www.wiley.com/en-us/The%2BData%2BWarehous}}</ref>
 
=== Hubs ===
Line 191 ⟶ 202:
== Tools ==
Some examples of tools are:{{clarify|date=April 2022}}
* [https://www.datavault4dbt.com DataVault4dbt]
* [https://datavault-builder.com 2150 Datavault Builder]
* [https://www.astera.com/products/data-warehouse-builder/ Astera DW Builder]