Data vault modeling: Difference between revisions

Content deleted Content added
AnomieBOT (talk | contribs)
m Basic notions: heading & description list markup; Fixed incorrect references formatting – You can help!
 
(47 intermediate revisions by 25 users not shown)
Line 1:
{{Short description|Database modeling method}}
{{Refimprove|date=November 2016}}
[[File:Data Vault Example.png|thumb|upright=1.5|Simple data vault model with two hubs (blue), one link (green) and four satellites (yellow)]]
 
'''DataDatavault''' or '''data vault modeling''' is a [[database]] modeling method that is designed to provide long-term historical storage of [[data]] coming in from multiple operational systems. It is also a method of looking at historical data that deals with issues such as auditing, tracing of data, loading speed and [[Resilience (organizational)|resilience]] to change as well as emphasizing the need to [[Audit trail|trace]] where all the data in the database [[Data lineage|came from]]. This means that every [[Row (database)|row]] in a data vault must be accompanied by record source and load date attributes, enabling an auditor to trace values back to the source. The concept was published in 2000 by [[Dan Linstedt]].
 
Data vault modeling makes no distinction between good and bad data ("bad" meaning not conforming to business rules).<ref>[[#dvsuper|Super Charge your data warehouse]], page 74</ref> This is summarized in the statement that a data vault stores "[[Single source of truth|a single version of the facts]]" (also expressed by [[Dan Linstedt]] as "all the data, all of the time") as opposed to the practice in other data warehouse methods of storing "a [[single version of the truth]]"<ref>[[#rdamhof1|The next generation EDW]]</ref> where data that does not conform to the definitions is removed or "cleansed". A data vault enterprise data warehouse provides both; a single version of facts and a single source of truth.<ref>Building a scalable datawarehouse with data vault 2.0, p. 6</ref>
 
The modeling method is designed to be resilient to change in the business environment where the data being stored is coming from, by explicitly separating [[Data structure|structural information]] from descriptive [[Attribute (computing)|attributes]].<ref>[[#dvsuper|Super Charge your data warehouse]], page 21</ref> Data vault is designed to enable [[Parallel computing|parallel]] loading as much as possible,<ref>[[#dvsuper|Super Charge your data warehouse]], page 76</ref> so that very large implementations can scale out without the need for major redesign.
 
Unlike the [[star schema]] ([[dimensional modelling]]) and the classical [[relational model]] (3NF), data vault and [[anchor modellingmodeling]] are well-suited for capturing changes that occur when a source system is changed or added, but are considered advanced techniques which require experienced [[data architect]]s.<ref>{{cite web|access-date=2023-02-22|first=Johan|language=sv|surname=Porsby|title=Rålager istället för ett strukturerat datalager|url=https://www.agero.se/blogg/ralager-istallet-for-ett-strukturerat-datalager|work=www.agero.se}}<!-- auto-translated by Module:CS1 translator --></ref> Both data vaults and anchor models are [[Entity (computer science)|entity-based]] models,<ref>{{cite web|access-date=2023-02-22|first=Johan|language=sv|surname=Porsby|title=Datamodeller för data warehouse|url=https://www.agero.se/blogg/datamodeller-for-data-warehouse|work=www.agero.se}}<!-- auto-translated by Module:CS1 translator --></ref> but anchor models have a more normalized approach.{{citation needed|date=March 2023}}
 
== History and philosophy ==
{{Original research|discuss=Template talk:Original research#discuss parameter|date=August 2019}}
 
In its early days, Dan Linstedt referred to the modeling technique which was to become data vault as ''common foundational warehouse architecture''<ref>Building a scalable datawarehouse with data vault 2.0, p. 11</ref> or ''common foundational modeling architecture''.<ref>Building a scalable datawarehouse with data vault 2.0, p. xv</ref> In [[data warehouse]] modeling there are two well-known competing options for modeling the layer where the data are stored. Either you model according to [[Ralph Kimball]], with conformed dimensions and an [[Enterprise bus matrix|enterprise data bus]], or you model according to [[Bill Inmon]] with the database [[normal forms|normalized]].<ref>{{CitationCite web needed|date=August2020-02-03 2019|title=Data Warehouse Concepts: Kimball vs. Inmon Approach |url=https://www.astera.com/type/blog/data-warehouse-concepts/ |access-date=2024-10-02 |website=Astera |language=en-US}}.</ref> Both techniques have issues when dealing with changes in the systems feeding the data warehouse{{Citation needed|date=August 2019}}. For conformed dimensions you also have to cleanse data (to conform it) and this is undesirable in a number of cases since this inevitably will lose information{{Citation needed|date=August 2019}}. Data vault is designed to avoid or minimize the impact of those issues, by moving them to areas of the data warehouse that are outside the historical storage area (cleansing is done in the data marts) and by separating the structural items (business keys and the associations between the business keys) from the descriptive attributes.
 
Dan Linstedt, the creator of the method, describes the resulting database as follows:
Line 22 ⟶ 23:
Another issue to which data vault is a response is that more and more there is a need for complete auditability and traceability of all the data in the data warehouse. Due to [[Sarbanes-Oxley]] requirements in the USA and similar measures in Europe this is a relevant topic for many business intelligence implementations, hence the focus of any data vault implementation is complete traceability and auditability of all information.
 
''Data Vault 2.0'' is the new specification. It is an [[open standard]].<ref>[[#dvos2|A short intro to#datavault 2.0]]</ref> The new specification consists of three pillars: methodology ([[Software Engineering InstitutInstitute|SEI]]/[[Capability Maturity Model|CMMI]], [[Six Sigma]], [[Systems development life cycle|SDLC]], etc..), the architecture (amongst others an input layer (data stage, called [[persistent staging area]] in Data Vault 2.0) and a presentation layer (data mart), and handling of data quality services and master data services), and the model. Within the methodology, the implementation of best practices is defined. Data Vault 2.0 has a focus on including new components such as [[big data]], [[NoSQL]] - and also focuses on the performance of the existing model. The old specification (documented here for the most part) is highly focused on data vault modeling. It is documented in the book: Building a Scalable Data Warehouse with Data Vault 2.0.<ref>{{Cite web |title=Building a Scalable Data Warehouse with Data Vault 2.0[Book] |url=https://www.oreilly.com/library/view/building-a-scalable/9780128026489/ |access-date=2024-10-02 |website=www.oreilly.com |language=en}}</ref>
 
It is necessary to evolve the specification to include the new components, along with the best practices in order to keep the EDW and BI systems current with the needs and desires of today's businesses.
 
=== History ===
Data vault modeling was originally conceived by Dan Linstedt in the 1990s and was released in 2000 as a public ___domain modeling method. In a series of five articles in The Data Administration Newsletter the basic rules of the Data Vault method are expanded and explained. These contain a general overview,<ref>[[#tdan1{{Cite web |date=2002-07-01 |title=Data Vault Series 1 – Data Vault Overview]] |url=https://tdan.com/data-vault-series-1-data-vault-overview/5054 |access-date=2024-10-02 |website=TDAN.com |language=en-US}}</ref> an overview of the components,<ref>[[#tdan2{{Cite web |date=2003-01-01 |title=Data Vault Series 2 – Data Vault Components]] |url=https://tdan.com/data-vault-series-2-data-vault-components/5155 |access-date=2024-10-02 |website=TDAN.com |language=en-US}}</ref> a discussion about end dates and joins,<ref>[[#tdan3|Data Vault Series 3 – End Dates and Basic Joins]]</ref> link tables,<ref>[[#tdan4|Data Vault Series 4 – Link tables]], paragraph 2.3</ref> and an article on loading practices.<ref name="DataVault_a">[[#tdan5|Data Vault Series 5 – Loading Practices]]</ref>
 
An alternative (and seldom used) name for the method is "Common Foundational Integration Modelling Architecture."<ref>[[#dwdummy|Data Warehousing for Dummies]], page 83</ref>
Line 41 ⟶ 42:
 
== Basic notions ==
Data Vault 2.0 organizes data into three core components that separate stable identifiers from changing descriptive attributes:<ref name="linstedt2015">{{cite book |last1=Linstedt |first1=Daniel |last2=Olschimke |first2=Michael |title=Building a Scalable Data Warehouse with Data Vault 2.0 |publisher=Morgan Kaufmann |year=2015 |isbn=9780128025109 |url=https://books.google.com/books/about/Building_a_Scalable_Data_Warehouse_with.html?id=lgDJBAAAQBAJ}}</ref>
Data vault attempts to solve the problem of dealing with change in the environment by separating the business keys (that do not mutate as often, because they uniquely identify a business entity) and the associations between those business keys, from the descriptive attributes of those keys.
 
* '''Hub''' – stores a unique business key for a core business concept together with minimal metadata for lineage/audit; it acts as an integration point across sources.<ref name="linstedt2015" />
The business keys and their associations are structural attributes, forming the skeleton of the data model. The data vault method has as one of its main axioms that real business keys only change when the business changes and are therefore the most stable elements from which to derive the structure of a historical database. If you use these keys as the backbone of a data warehouse, you can organize the rest of the data around them. This means that choosing the correct keys for the hubs is of prime importance for the stability of your model.<ref>[[#dvsuper|Super Charge your data warehouse]], page 61, why are business keys important</ref> The keys are stored in tables with a few constraints on the structure. These key-tables are called hubs.
* '''Link''' – captures the relationship (often many-to-many) between hubs; the participating hub keys define the grain of the relationship.<ref name="linstedt2015" />
* '''Satellite''' – contains descriptive attributes and their history associated with a hub or link; satellites are append-only so every change is preserved (similar in effect to Type-II history in dimensional models).<ref name="linstedt2015" />
 
Specialized satellites support temporal semantics. For example, an '''effectivity satellite''' on a link records begin/end dates representing when the relationship is considered effective by the business.<ref name="linstedt2015" />
 
=== Layers ===
*'''Raw Vault''' – a source-driven integration layer that retains granular, auditable history with minimal transformations.<ref name="linstedt2015" />
*'''Business Vault''' – a derived layer that applies business rules and query-assistance structures (e.g., PIT and bridge tables) to facilitate downstream consumption.<ref name="linstedt2015" />
 
=== Use with dimensional models ===
In practice, Data Vault commonly serves as the historical integration layer, while star-schema information marts are projected from the Raw/Business Vault for performant analytics and simpler user access.<ref name="hultgren">{{cite book |last=Hultgren |first=Hans |title=Modeling the Agile Data Warehouse with Data Vault |year=2012 |publisher=Brighton Hamilton |isbn=9780615723082 |url=https://searchworks.stanford.edu/view/11487017}}</ref><ref name="kimball2013">{{cite web |title=The Data Warehouse Toolkit, 3rd Edition |website=Wiley |url=https://www.wiley.com/en-us/The%2BData%2BWarehous}}</ref>
 
=== Hubs ===
Line 75 ⟶ 87:
 
=== Links ===
Associations or transactions between business keys (relating for instance the hubs for customer and product with each other through the purchase transaction) are modeled using link tables. These tables are basically [[Many-to-many (data model)|many-to-many join]] tables, with some metadata.
 
Links can link to other links, to deal with changes in granularity (for instance, adding a new key to a database table would change the grain of the database table). For instance, if you have an association between customer and address, you could add a reference to a link between the hubs for product and transport company. This could be a link called "Delivery". Referencing a link in another link is considered a bad practice, because it introduces dependencies between links that make parallel loading more difficult. Since a link to another link is the same as a new link with the hubs from the other link, in these cases creating the links without referencing other links is the preferred solution (see the section on loading practices for more information).
Line 107 ⟶ 119:
 
All the tables contain metadata, minimally describing at least the source system and the date on which this entry became valid, giving a complete historical view of the data as it enters the data warehouse.
 
An effectivity satellite is a satellite built on a link, "and record[s] the time period when the corresponding link records start and end effectivity".<ref>[https://dbtvault.readthedocs.io/en/latest/tutorial/tut_eff_satellites/ Effectivity Satellites - dbtvault]</ref>
 
==== Satellite example ====
Line 120 ⟶ 134:
|-
|-
| S_SEQ_NR || Ordering or [[sequence number]], to enforce uniqueness if there are several valid satellites for one parent key || No(**) || This can happen if, for instance, you have a hub COURSE and the name of the course is an attribute but in several different languages.
|-
| S_LDTS || Load Date (startdate) for the validity of this combination of attribute values for parent key L_DRIVER_ID || Yes
Line 178 ⟶ 192:
For this purpose, the hubs and related satellites on those hubs can be considered as dimensions and the links and related satellites on those links can be viewed as fact tables in a dimensional model. This enables you to quickly prototype a dimensional model out of a data vault model using views.
 
Note that while it is relatively straightforward to move data from a data vault model to a (cleansed) dimensional model, the reverse is not as easy, given the denormalized nature of the dimensional model's fact tables, fundamentally different to the [[third normal form]] of the data vault.<ref>[https://melbournevault.com.au/ Melbournevault], 16 May 2023</ref>
 
== Data vault methodologyMethodology ==
 
The data vault methodology is based on [[Software Engineering Institute|SEI]]/[[CMMI]] Level 5 best practices. It includes multiple components of CMMI Level 5, and combines them with best practices from [[Six Sigma]], [[Totaltotal quality management|TQM]] (TQM), and SDLC. Particularly, it is focused on Scott Ambler's [[agile methodology]] for build out and deployment. Data vault projects have a short, scope-controlled release cycle and should consist of a production release every 2 to 3 weeks.
 
Teams using the data vault methodology should readily adapt to the repeatable, consistent, and measurable projects that are expected at CMMI Level 5. Data that flow through the EDW data vault system will begin to follow the TQM (total quality management) life-cycle that has long been missing from BI (business intelligence) projects.
 
== Tools ==
Some examples of tools are:{{clarify|date=April 2022}}
* [https://www.datavault4dbt.com DataVault4dbt]
* [https://datavault-builder.com 2150 Datavault Builder]
* [https://www.astera.com/products/data-warehouse-builder/ Astera DW Builder]
* [https://wherescape.com Wherescape]
* [https://vaultspeed.com Vaultspeed]
* [https://hub.getdbtautomate-dv.com/datavault-uk/dbtvault/latest/ dbtvaultAutomateDV]
 
== See Alsoalso ==
* {{annotated link|Bill Inmon}}
 
* {{annotated link|Data lake}}
* [[{{annotated link|Data warehouse]]}}
* [[Bill Inmon]]
* {{annotated link|The Kimball lifecycle}}, developed by {{annotated link|Ralph Kimball}}
* [[The Kimball Lifecycle]]
* {{annotated link|Staging area}}
* {{annotated link|Agile Business Intelligence}}
 
== References ==
Line 209 ⟶ 227:
* {{cite book |ref = dwdummy |author1=Thomas C. Hammergren |author2=Alan R. Simon |date=February 2009 |title = Data Warehousing for Dummies, 2nd edition |publisher=John Wiley & Sons |isbn = 978-0-470-40747-9
}}
* {{cite journal |ref = rdamhof1 |author1 = Ronald Damhof |author2 = Lidwine van As |date = August 25, 2008 |title = The next generation EDW – Letting go of the idea of a single version of the truth |journal = Database Magazine (DB/M) |publisher = Array Publications B.V. |url = http://prudenza.typepad.com/files/damhof_dbm0508_eng-1.pdf }}
* {{cite web |ref = dvrules1tdan1 |last = Linstedt |first = Dan |title = Data Vault ModelingSeries Specification1 v1.0.9– Data Vault Overview |url = http://danlinstedtwww.tdan.com/datavaultcat/standards/dv-modeling-specificationview-v1-0-8articles/5054/ |work = Data Vault ForumSeries |publisher = DanThe LinstedtData Administration Newsletter |access-date = 2612 September 20122011 }}
}}
* {{cite web |ref = tdan1tdan2 |last = Linstedt |first = Dan |title = Data Vault Series 12 – Data Vault OverviewComponents |url = http://www.tdan.com/view-articles/50545155/ |work = Data Vault Series |publisher = The Data Administration Newsletter |access-date = 12 September 2011 }}
* {{cite web |ref = dvspec2tdan3 |last = Linstedt |first = Dan |title = Data Vault 2.0Series 3 – End Dates and BeingBasic AnnouncedJoins |url = http://danlinstedtwww.tdan.com/datavaultcat/data-vaultview-2-0-being-announcedarticles/5067/ |websitework = DanLinstedt.comData Vault Series |publisher = DanThe LinstedtData Administration Newsletter |access-date = 2014-01-0312 September 2011 }}
}}
* {{cite web |ref =tdan2 tdan4 |last = Linstedt |first = Dan |title = Data Vault Series 24Data VaultLink ComponentsTables |url = http://www.tdan.com/view-articles/51555172/ |work = Data Vault Series |publisher = The Data Administration Newsletter |access-date = 12 September 2011 }}
* {{cite web |ref = tdan5 |last = Linstedt |first = Dan |title = Data Vault Series 5 – Loading Practices |url = http://www.tdan.com/view-articles/5285/ |work = Data Vault Series |publisher = The Data Administration Newsletter |access-date = 12 September 2011 }}
}}
* {{cite web |ref = tdan3dvforum1b |last = LinstedtKunenborg |first = DanRonald |title = Data Vault SeriesRules 3 – End Dates andv1.0.8 BasicCheat JoinsSheet |url = http://www.tdangrundsatzlich-it.comnl/viewwp-articlescontent/5067uploads/pub1-dv_datasheet_v108_a3.pdf |work = Data Vault SeriesRules |publisher = TheGrundsätzlich Data Administration NewsletterIT |access-date = 1226 September 20112012 }} Cheat sheet reflecting the rules in v1.0.8 and additional clarification from the forums on the rules in v1.0.8.
* {{cite web |ref = dvloadspec1dvrules1 |last = Linstedt |first = Dan |title = Data Vault LoadingModeling Specification v1.20.9 |url = http://danlinstedt.com/datavaultcat/standards/datadv-vault-loadingmodeling-specification-v1-20-8/ |websitework = DanLinstedt.comData Vault Forum |publisher = Dan Linstedt |access-date =2014 26 September 2012 |archive-01date = 30 November 2012 |archive-03url = https://web.archive.org/web/20121130131024/http://danlinstedt.com/datavaultcat/standards/dv-modeling-specification-v1-0-8/ |url-status = dead }}
}}
* {{cite web |ref = tdan4dvloadspec1 |last = Linstedt |first = Dan |title = Data Vault SeriesLoading 4Specification – Link Tablesv1.2 |url = http://www.tdandanlinstedt.com/view-articlesdatavaultcat/5172standards/data-vault-loading-specification-v1-2/ |workwebsite =Data Vault SeriesDanLinstedt.com |publisher =The DataDan Administration NewsletterLinstedt |access-date = 122014-01-03 |archive-date = 2014-01-03 |archive-url = https://web.archive.org/web/20140103052653/http://danlinstedt.com/datavaultcat/standards/data-vault-loading-specification-v1-2/ |url-status = Septemberdead 2011}}
* {{cite web |ref = dvos2 |last = Linstedt |first = Dan |title = A short intro to #datavault 2.0 |url = http://danlinstedt.com/datavaultcat/a-short-intro-to-datavault-2-0/ |website = DanLinstedt.com |publisher = Dan Linstedt |access-date = 2014-01-03 |archive-date = 2014-01-03 |archive-url = https://web.archive.org/web/20140103055015/http://danlinstedt.com/datavaultcat/a-short-intro-to-datavault-2-0/ |url-status = dead }}
}}
* {{cite web |ref = tdan5dvspec2 |last = Linstedt |first = Dan |title = Data Vault Series 5 –2.0 LoadingBeing PracticesAnnounced |url = http://www.tdandanlinstedt.com/view-articlesdatavaultcat/5285data-vault-2-0-being-announced/ |workwebsite = Data Vault SeriesDanLinstedt.com |publisher = TheDan DataLinstedt Administration|access-date Newsletter= 2014-01-03 |accessarchive-date = 122012-08-21 |archive-url = https://web.archive.org/web/20120821012328/http://danlinstedt.com/datavaultcat/data-vault-2-0-being-announced/ |url-status = Septemberdead 2011}}
}}
* {{cite web |ref = dvforum1b |last = Kunenborg |first = Ronald |title = Data Vault Rules v1.0.8 Cheat Sheet |url = http://www.grundsatzlich-it.nl/wp-content/uploads/pub1-dv_datasheet_v108_a3.pdf |work = Data Vault Rules |publisher=Grundsätzlich IT |access-date=26 September 2012 }} Cheat sheet reflecting the rules in v1.0.8 and additional clarification from the forums on the rules in v1.0.8.
* {{cite web |ref = dvrules1 |last = Linstedt |first = Dan |title = Data Vault Modeling Specification v1.0.9 |url = http://danlinstedt.com/datavaultcat/standards/dv-modeling-specification-v1-0-8/ |work = Data Vault Forum |publisher = Dan Linstedt |access-date = 26 September 2012
}}
* {{cite web |ref = dvloadspec1 |last = Linstedt |first = Dan |title = Data Vault Loading Specification v1.2 |url = http://danlinstedt.com/datavaultcat/standards/data-vault-loading-specification-v1-2/ |website = DanLinstedt.com |publisher = Dan Linstedt |access-date=2014-01-03
}}
* {{cite web |ref = dvos2 |last = Linstedt |first = Dan |title = A short intro to #datavault 2.0 |url = http://danlinstedt.com/datavaultcat/a-short-intro-to-datavault-2-0/ |website = DanLinstedt.com |publisher = Dan Linstedt |access-date=2014-01-03
}}
* {{cite web |ref = dvspec2 |last = Linstedt |first = Dan |title = Data Vault 2.0 Being Announced |url = http://danlinstedt.com/datavaultcat/data-vault-2-0-being-announced/ |website = DanLinstedt.com |publisher = Dan Linstedt |access-date = 2014-01-03
}}
 
; Dutch language sources
Line 264 ⟶ 272:
 
==External links==
* [http://www.datavaultalliance.com The home for the Data Vault community users]
* [https://www.datavaultalliance.com/certification The path to Certification]
* [http://www.danlinstedt.com The homepage of Dan Linstedt, the inventor of Data Vault modeling]
* [http://www.learndatavault.com A website dedicated to Data Vault, maintained by Dan Linstedt]
* [https://www.youtube.com/LearnDataVault Youtube videos on Data Vault Modeling Approach and Methodology]
* [http://www.slideshare.net/dlinstedt Dan Linstedt's Slide Share Site]
* [http://www.dataVaultCertification.com Data Vault Certification Site]
* [http://www.AgileData.org Agile Data Site]
* [http://www.DisciplinedAgileDelivery.com Disciplined Agile Delivery (DAD) Site]
 
{{Data warehouse}}