Content deleted Content added
m Substing templates: {{Kilde www}}. See User:AnomieBOT/docs/TemplateSubster for info. |
m →Basic notions: heading & description list markup; Fixed incorrect references formatting – You can help! |
||
(47 intermediate revisions by 25 users not shown) | |||
Line 1:
{{Short description|Database modeling method}}
{{Refimprove|date=November 2016}}
[[File:Data Vault Example.png|thumb|upright=1.5|Simple data vault model with two hubs (blue), one link (green) and four satellites (yellow)]]
'''
Data vault modeling makes no distinction between good and bad data ("bad" meaning not conforming to business rules).<ref>[[#dvsuper|Super Charge your data warehouse]], page 74</ref> This is summarized in the statement that a data vault stores "[[Single source of truth|a single version of the facts]]" (also expressed by [[Dan Linstedt]] as "all the data, all of the time") as opposed to the practice in other data warehouse methods of storing "a [[single version of the truth]]"<ref>[[#rdamhof1|The next generation EDW]]</ref> where data that does not conform to the definitions is removed or "cleansed". A data vault enterprise data warehouse provides both; a single version of facts and a single source of truth.<ref>Building a scalable datawarehouse with data vault 2.0, p. 6</ref>
The modeling method is designed to be resilient to change in the business environment where the data being stored is coming from, by explicitly separating [[Data structure|structural information]] from descriptive [[Attribute (computing)|attributes]].<ref>[[#dvsuper|Super Charge your data warehouse]], page 21</ref> Data vault is designed to enable [[Parallel computing|parallel]] loading as much as possible,<ref>[[#dvsuper|Super Charge your data warehouse]], page 76</ref> so that very large implementations can scale out without the need for major redesign.
Unlike the [[star schema]] ([[dimensional modelling]]) and the classical [[relational model]] (3NF), data vault and [[anchor
== History and philosophy ==
{{Original research
In its early days, Dan Linstedt referred to the modeling technique which was to become data vault as ''common foundational warehouse architecture''<ref>Building a scalable datawarehouse with data vault 2.0, p. 11</ref> or ''common foundational modeling architecture''.<ref>Building a scalable datawarehouse with data vault 2.0, p. xv</ref> In [[data warehouse]] modeling there are two well-known competing options for modeling the layer where the data are stored. Either you model according to [[Ralph Kimball]], with conformed dimensions and an [[Enterprise bus matrix|enterprise data bus]], or you model according to [[Bill Inmon]] with the database [[normal forms|normalized]].<ref>{{
Dan Linstedt, the creator of the method, describes the resulting database as follows:
Line 22 ⟶ 23:
Another issue to which data vault is a response is that more and more there is a need for complete auditability and traceability of all the data in the data warehouse. Due to [[Sarbanes-Oxley]] requirements in the USA and similar measures in Europe this is a relevant topic for many business intelligence implementations, hence the focus of any data vault implementation is complete traceability and auditability of all information.
''Data Vault 2.0'' is the new specification. It is an [[open standard]].<ref>[[#dvos2|A short intro to#datavault 2.0]]</ref> The new specification consists of three pillars: methodology ([[Software Engineering
It is necessary to evolve the specification to include the new components, along with the best practices in order to keep the EDW and BI systems current with the needs and desires of today's businesses.
=== History ===
Data vault modeling was originally conceived by Dan Linstedt in the 1990s and was released in 2000 as a public ___domain modeling method. In a series of five articles in The Data Administration Newsletter the basic rules of the Data Vault method are expanded and explained. These contain a general overview,<ref>
An alternative (and seldom used) name for the method is "Common Foundational Integration Modelling Architecture."<ref>[[#dwdummy|Data Warehousing for Dummies]], page 83</ref>
Line 41 ⟶ 42:
== Basic notions ==
Data Vault 2.0 organizes data into three core components that separate stable identifiers from changing descriptive attributes:<ref name="linstedt2015">{{cite book |last1=Linstedt |first1=Daniel |last2=Olschimke |first2=Michael |title=Building a Scalable Data Warehouse with Data Vault 2.0 |publisher=Morgan Kaufmann |year=2015 |isbn=9780128025109 |url=https://books.google.com/books/about/Building_a_Scalable_Data_Warehouse_with.html?id=lgDJBAAAQBAJ}}</ref>
* '''Hub''' – stores a unique business key for a core business concept together with minimal metadata for lineage/audit; it acts as an integration point across sources.<ref name="linstedt2015" />
* '''Link''' – captures the relationship (often many-to-many) between hubs; the participating hub keys define the grain of the relationship.<ref name="linstedt2015" />
* '''Satellite''' – contains descriptive attributes and their history associated with a hub or link; satellites are append-only so every change is preserved (similar in effect to Type-II history in dimensional models).<ref name="linstedt2015" />
Specialized satellites support temporal semantics. For example, an '''effectivity satellite''' on a link records begin/end dates representing when the relationship is considered effective by the business.<ref name="linstedt2015" />
=== Layers ===
*'''Raw Vault''' – a source-driven integration layer that retains granular, auditable history with minimal transformations.<ref name="linstedt2015" />
*'''Business Vault''' – a derived layer that applies business rules and query-assistance structures (e.g., PIT and bridge tables) to facilitate downstream consumption.<ref name="linstedt2015" />
=== Use with dimensional models ===
In practice, Data Vault commonly serves as the historical integration layer, while star-schema information marts are projected from the Raw/Business Vault for performant analytics and simpler user access.<ref name="hultgren">{{cite book |last=Hultgren |first=Hans |title=Modeling the Agile Data Warehouse with Data Vault |year=2012 |publisher=Brighton Hamilton |isbn=9780615723082 |url=https://searchworks.stanford.edu/view/11487017}}</ref><ref name="kimball2013">{{cite web |title=The Data Warehouse Toolkit, 3rd Edition |website=Wiley |url=https://www.wiley.com/en-us/The%2BData%2BWarehous}}</ref>
=== Hubs ===
Line 75 ⟶ 87:
=== Links ===
Associations or transactions between business keys (relating for instance the hubs for customer and product with each other through the purchase transaction) are modeled using link tables. These tables are basically [[Many-to-many (data model)|many-to-many join]] tables, with some metadata.
Links can link to other links, to deal with changes in granularity (for instance, adding a new key to a database table would change the grain of the database table). For instance, if you have an association between customer and address, you could add a reference to a link between the hubs for product and transport company. This could be a link called "Delivery". Referencing a link in another link is considered a bad practice, because it introduces dependencies between links that make parallel loading more difficult. Since a link to another link is the same as a new link with the hubs from the other link, in these cases creating the links without referencing other links is the preferred solution (see the section on loading practices for more information).
Line 107 ⟶ 119:
All the tables contain metadata, minimally describing at least the source system and the date on which this entry became valid, giving a complete historical view of the data as it enters the data warehouse.
An effectivity satellite is a satellite built on a link, "and record[s] the time period when the corresponding link records start and end effectivity".<ref>[https://dbtvault.readthedocs.io/en/latest/tutorial/tut_eff_satellites/ Effectivity Satellites - dbtvault]</ref>
==== Satellite example ====
Line 120 ⟶ 134:
|-
|-
| S_SEQ_NR || Ordering or [[sequence number]], to enforce uniqueness if there are several valid satellites for one parent key || No(**) || This can happen if, for instance, you have a hub COURSE and the name of the course is an attribute but in several different languages.
|-
| S_LDTS || Load Date (startdate) for the validity of this combination of attribute values for parent key L_DRIVER_ID || Yes
Line 178 ⟶ 192:
For this purpose, the hubs and related satellites on those hubs can be considered as dimensions and the links and related satellites on those links can be viewed as fact tables in a dimensional model. This enables you to quickly prototype a dimensional model out of a data vault model using views.
Note that while it is relatively straightforward to move data from a data vault model to a (cleansed) dimensional model, the reverse is not as easy, given the denormalized nature of the dimensional model's fact tables, fundamentally different to the [[third normal form]] of the data vault.<ref>[https://melbournevault.com.au/ Melbournevault], 16 May 2023</ref>
==
The data vault methodology is based on [[Software Engineering Institute|SEI]]/[[CMMI]] Level 5 best practices. It includes multiple components of CMMI Level 5, and combines them with best practices from [[Six Sigma]], [[
Teams using the data vault methodology should readily adapt to the repeatable, consistent, and measurable projects that are expected at CMMI Level 5. Data that flow through the EDW data vault system will begin to follow the TQM
== Tools ==
Some examples of tools are:{{clarify|date=April 2022}}
* [https://www.datavault4dbt.com DataVault4dbt]
* [https://datavault-builder.com 2150 Datavault Builder]
* [https://www.astera.com/products/data-warehouse-builder/ Astera DW Builder]
* [https://wherescape.com Wherescape]
* [https://vaultspeed.com Vaultspeed]
* [https://
== See
* {{annotated link|Bill Inmon}}
* {{annotated link|Data lake}}
*
* {{annotated link|The Kimball lifecycle}}, developed by {{annotated link|Ralph Kimball}}
* {{annotated link|Staging area}}
* {{annotated link|Agile Business Intelligence}}
== References ==
Line 209 ⟶ 227:
* {{cite book |ref = dwdummy |author1=Thomas C. Hammergren |author2=Alan R. Simon |date=February 2009 |title = Data Warehousing for Dummies, 2nd edition |publisher=John Wiley & Sons |isbn = 978-0-470-40747-9
}}
* {{cite journal |ref = rdamhof1 |author1 = Ronald Damhof |author2 = Lidwine van As |date = August 25, 2008 |title = The next generation EDW – Letting go of the idea of a single version of the truth |journal = Database Magazine (DB/M) |publisher = Array Publications B.V. |url = http://prudenza.typepad.com/files/damhof_dbm0508_eng-1.pdf }}
* {{cite web |ref =
* {{cite web |ref =
* {{cite web |ref =
* {{cite web |ref =
* {{cite web |ref = tdan5 |last = Linstedt |first = Dan |title = Data Vault Series 5 – Loading Practices |url = http://www.tdan.com/view-articles/5285/ |work = Data Vault Series |publisher = The Data Administration Newsletter |access-date = 12 September 2011 }}
* {{cite web |ref =
* {{cite web |ref =
* {{cite web |ref =
* {{cite web |ref = dvos2 |last = Linstedt |first = Dan |title = A short intro to #datavault 2.0 |url = http://danlinstedt.com/datavaultcat/a-short-intro-to-datavault-2-0/ |website = DanLinstedt.com |publisher = Dan Linstedt |access-date = 2014-01-03 |archive-date = 2014-01-03 |archive-url = https://web.archive.org/web/20140103055015/http://danlinstedt.com/datavaultcat/a-short-intro-to-datavault-2-0/ |url-status = dead }}▼
* {{cite web |ref =
▲* {{cite web |ref = dvrules1 |last = Linstedt |first = Dan |title = Data Vault Modeling Specification v1.0.9 |url = http://danlinstedt.com/datavaultcat/standards/dv-modeling-specification-v1-0-8/ |work = Data Vault Forum |publisher = Dan Linstedt |access-date = 26 September 2012
▲* {{cite web |ref = dvloadspec1 |last = Linstedt |first = Dan |title = Data Vault Loading Specification v1.2 |url = http://danlinstedt.com/datavaultcat/standards/data-vault-loading-specification-v1-2/ |website = DanLinstedt.com |publisher = Dan Linstedt |access-date=2014-01-03
▲* {{cite web |ref = dvos2 |last = Linstedt |first = Dan |title = A short intro to #datavault 2.0 |url = http://danlinstedt.com/datavaultcat/a-short-intro-to-datavault-2-0/ |website = DanLinstedt.com |publisher = Dan Linstedt |access-date=2014-01-03
▲* {{cite web |ref = dvspec2 |last = Linstedt |first = Dan |title = Data Vault 2.0 Being Announced |url = http://danlinstedt.com/datavaultcat/data-vault-2-0-being-announced/ |website = DanLinstedt.com |publisher = Dan Linstedt |access-date = 2014-01-03
; Dutch language sources
Line 264 ⟶ 272:
==External links==
* [http://www.danlinstedt.com The homepage of Dan Linstedt, the inventor of Data Vault modeling]
{{Data warehouse}}
|