The Data Vault Methodology
Data Warehouses have served the purpose of providing a source of value-added information for quite some decades. In supporting business users in their day-to-day operations, Data Warehouses integrate multiple sources used throughout the organisation, track history for knowledge improvements, auditing and compliancy, and provide the business with an easy-to-access and easy-to-explore insight layer. While Data Warehouses are often modelled in Kimball or Inmon approach, Data Vault is also a vivid possibility. Data Vault is renowned for its stereotypical modelling technique involving Hubs, Links and Satellites. Next to that, it also provides a project approach, architectures, and implementation guidelines.
Data Vault modelling
In a Data Vault oriented Data Warehouse, the ‘Data Vault layer’ is the beating heart of your information. It houses all data that your organisation has ever recorded and uniformly models it in three components: Hubs, Satellites and Links. The goal of this layer Data Vault layer is to track every single change of attribute in your source systems and provide a consistent ‘base’ on which subsequent objects are built (e.g. Kimball-oriented layer, data/information marts, OLAP cubes). The three types of tables supporting the Data Vault layer, serve following purpose:
- Hubs uniquely identify records (customers, products, ….) over systems throughout (and even outside of) the organisation. Only the (hash of this) unique key is stored in this object along with some technical logging columns.
- Links model relationships between object of hubs. These could be associations, hierarchies, redefinitions of business terms, …. Every relationship is modelled as a many-to-many to improve flexibility and support changing business processes.
- Satellites track all history related to the unique record in the hub or the properties of By using a start and end date, the validity of each historical observation is registered in this object. As they contain all properties related to the hub or link, these objects are often very wide (= lots of columns).
- A minimalistic approach to Data Vault would include following layers:
- A Staging area is used as a preliminary step to minimise the burden of extracting data out of source systems and as a support to feeding the Data Vault layer (already implementing some constraints). Staging only serves as a pass-through, no history is captured here.
- The Data Vault layer modelled using Hubs, Links and Satellites. Its sole purpose is tracking changes and providing a uniformly modelled access layer. In contrary to what you would expect from your this layer that houses your ‘single source of truth’, business logic (such as business interprepretations and additional calculations) are not housed in this layer.
- An information layer that is presented to business users. This can include (Kimball) Star Schemes, data/information marts and OLAP cubes; integrating business logic.
As can be noted, the Staging Layer remains and is conform other modelling techniques. In all methodologies, a Data Warehouse wants to minimise its impact on operational systems and thus makes a one-on-one copy of the transactional data into this layer. Contradictory to some methodologies, Staging does not maintain any historical data as this is done in the Data Vault layer. Technical constraints are still enforced in this layer, which may be a source of interruption (e.g. underestimated the size of a data-column, expected the wrong data type, estimated referential integrity, ….).
The Data Vault layer then wrangles the data which has landed in the Staging into Hubs, Links and Satellites. This includes transformations of denormalization (= joining together data that is related to the same business key into the satellite), normalisation (= storing information in satellites referencing a hub) and sanity-tests of historical existence (= checking if the same relationships still apply, or if one has been deleted). The conversion to a Data Vault layer should not be taken lightly and requires lots of analysis upfront defining the correct Business Key! Due to its standardized hub-and-spoke architecture, large parts of loading this layer can be automated and a Data Vault layer could even be generated if your source system applies primary keys and foreign keys. Due to its nature, transformation to a Data Vault layer can benefit, but also requires, high parallelization. Distributed systems (such as Spark/DataBricks, Multi-Parallel Processing engines/Synapse) can aid in this parallelization. Contrasting traditionally modelled Data Warehouses, this layer is not directly accessed by end-users (mostly due to the high complexity of writing queries that span multiple Hubs, Links and Satellites which are modelled in a Many-To-Many way).
With all data available in the Data Vault layer, the Information Layer adds end-user accessible objects such as (Kimball) Star Schema’s, Data/Information marts and OLAP cubes. End-users have data available in a format they can slice-and-dice, query, or traverse. The Data Vault objects are often denormalized and, in this transformation, only the business-relevant historical information reaches this layer. It must be noted that a traditional Data Warehouse architecture is built upon the Data Vault layer. As the objects in the information layer are built upon the history, these information objects can also be rebuilt or changed without losing information. A possible use-case could be to retroactively extend historically tracked information to a dimension table.
From the perspective of the business user, the interaction with the information does not change: Business users should not be aware that a Data Vault is supporting their information objects. Like traditional architectures, the objects on which they interact remain the same.
Data Vault adds a uniform data-historization layer to your data architecture with a solid focus on identifying the correct business keys. Traditional Information Layer objects are still used to present the information to your business users but can be re-created with historically correct information. Hence, the true added value of Data Vault lies in supporting history on data in a uniform structure and, by using this uniformity, It is not uncommon that ‘similar’ data is housed in different applications; e.g. a customer defined in your invoicing application, in your marketing application and in your brand-loyalty application. Data Vault helps with creating an application independent layer that facilitates changing business rules to define your ‘up-to-date customer’.
One must consider that adding this intermediary layer results in a very long data-processing chain (longer chains resulting in a higher chance on errors). The source data undergoes a lot of transformations and joins throughout this process. Often, an additional ‘Business Layer’ is also modelled on top of the Data Vault layer which does incorporates the business logic and master data aspects (the original layer that doesn’t incorporate business logic is then referred to as the Raw Data Vault). This minimises implementing the same rules in separate information layer objects. As this Business Vault also is modelled in Hubs, Satellites and Links; this is not a user-friendly point of access for non-experienced data analyst.
After creating the Data Vault and the Business Vault, the Information Layer objects still need to be created. It is not uncommon that transactional elements are inputted into Staging, then normalized in Data Vault and then denormalized again to its original form in the Information Layer (e.g. Ticket Sales). While Information Layer objects could be rebuilt, such scenarios rarely occur as the Information Layer has a lot of dependencies.
Does your organisation need Data Vault in their Data Landscape? The typical consultancy answer would be ‘it depends’. If historical tracing in your Line-Of-Business is imperative in your Data Warehouse, lots of sources need to be integrated and you are well-willing to invest in one or two newly modelled data-layer then we surely recommend Data Vault.
Trying to avoid the layered approach and shorting the time from extraction to insights, following approaches can be evaluated:
- A ‘Data Lake’ layer storing each change happening at the source: it is not uncommon that each and every version is stored in a (cost-effective) data lake. Maintaining all this information requires far less modelling effort and can easily be implemented. On the contrary, business keys are not defined throughout your organisation and future integrations may not be as easy as using DataVault.
- An Operational Data Store which tracks information: an Operational Data Store (ODS) is a separate layer that presents data from source systems in a third-normal form. By automatically tracking the changes to this ODS, all history is captured. Technically, the ‘Temporal Table’ functionality of SQL Server could be used for databases and the ‘Delta History’ could be used for Delta files in a Hive metastore designed ODS. While easy to implement and set up, this approach does not completely provide the same support of exhaustive changes in your source system (e.g. a one-to-many in your source system becomes a many-to-many).
Each approach has its own pros and cons. An experienced BI architect can guide you to the best applicable solution for your situation.
If you’re interested in the Data Vault methodology, we recommend the following books. These books address all aspects of Data Vault and have been a source of inspiration for modelling data objects.
- Building a Scalable Data Warehouse with Data Vault 2.0 – Daniel Linstedt & Michael Olschminke
- Super Charge Your Data Warehouse – Daniel Linstedt
Sander Allert is an experienced BI architect with a passion for following new trends. Sander is passionate about data in all of its aspects (Big Data, Data Science, Self-Service BI, Master Data, …) and loves to share his knowledge. Do you need help on architectural decisions, do not hesitate to invite Sander over for a coffee to share some ideas.