Implementing Data Vault 2.0 on Fabric Data Warehouse
This Article is Authored By Michael Olschimke, co-founder and CEO at Scalefree International GmbH and Co-authored with @Trung_Ta Senior BI Consultant from Scalefree
The Technical Review is done by Ian Clarke and Naveed Hussain – GBBs (Cloud Scale Analytics) for EMEA at Microsoft
Introduction
In the previous articles of this series, we have discussed how to model Data Vault on Microsoft Fabric. Our initial focus was on the basic entity types including hubs, links, and satellites; advanced entity types, such as non-historized links and multi-active satellites and the third article was modeling a more complete model, including a typical modeling process, for Microsoft Dynamics CRM data.
But the model only serves a purpose: our goal for the entire blog series is to build a data platform using Microsoft Fabrics on the Azure Cloud. And for that, we also have to load the source data into the target model. This is the topic of this article: how to load the Data Vault entities. For that reason, we continue with our discussion of the basic and advanced Data Vault entities. The Microsoft Dynamics CRM article should be considered as an excursion to demonstrate how a more comprehensive model looks like and, also important, how we get there.
Data Vault 2.0 Design Principles
Data Vault was developed by Dan Linstedt, who is the inventor of Data Vault and co-founder of Scalefree. It is designed to meet the challenging requirements of the first users in the U.S. government. These requirements led to certain design principles that are the basis for the features and characteristics of Data Vault today.
One requirement is to perform insert-only operations on the target database. The data platform team might be required to delete records for legal reasons (e.g., GDPR or HIPAA), but updating records will never be done. The advantage of this approach is twofold: the performance of inserts is much faster than deleting or updating records. Also, inserting records is more in line with the task at hand: the data platform should capture source data and changes to the source data. Updating records means losing the old version of the record.
Another feature of the loading patterns in Data Vault is the ability to load the data incrementally. There is no need at all to load the same record twice into the Raw Data Vault layer. But also in subsequent layers, such as the Business Vault, there is no need to touch the same record another time, or run a calculation on it. Yes, it is not true for every case, but most cases. And it requires diligent implementation practices. But it is possible, and we have built many systems this way: in the best case, we touch data only if we haven’t touched it before.
The insert-only, incremental approach leads to full restartability: if only records which have not been processed yet are loaded in the next execution of the load procedure, why not partition the data in independent sub-sets, and then load partition by partition? If something goes wrong, just restart the process: it will continue loading what’s not in the target yet.
And if the partitions are based on independent sub-sets, the next step is to parallelize the loads on multiple distributed nodes. This requires independent processes, which means that there should be no dependencies between the individual loading procedures of the Raw Data Vault (and all the other layers). That leads to high-performant, scalable distributed systems in the cloud to process any data volume at any speed.
This design, and the modeling style of Data Vault, leads to many parallel jobs to be executed. On the other hand, due to the standardization of the model (all hubs, links, and satellites are based on simple, repeating patterns), it is possible to standardize the loading patterns as well. Essentially, every hub is not only modelled in a similar way, using a generation template, but also the loading process is based on templates. The same is true for links, satellites, and all special entity types.
The standardization and the number of loading procedures to be produced leads then to the need (and possibility) of automation. There is no need to invent these tools: tools such as Vaultspeed are readily available for project teams to use and speed up their development. These tools help the team to maintain the generated processes at scale.
It should also not matter which technology is used to load the Raw Data Vault: some customers prefer SQL statements, while others prefer Python scripts or ETL tools such as SQL Server Integration Services (SSIS). The Data Vault concepts for loading the entities are tool-agnostic and can even be performed in real-time scenarios, for example with Python or C# code.
These requirements are regarding the implementation of the Data Vault loading procedures. Additional, more business-focused requirements have been discussed in our introductory article of this blog series.
Identifying Records
Another design choice is made by the data modeller: how records should be identified in the Data Vault model. There are three options that are commonly used: sequences, hash keys or business keys.
Sequences have been used back in Data Vault 1.0 and are an option till today. However, not a desired one: all of our clients who are using sequences want to get rid of them by migrating to Data Vault 2.0 with hash keys or business keys to be used to identify records. There are many issues with the use of sequences in Data Vault models: one is the required loading order. Hubs must be loaded first because the sequences are used in links and hub satellites to refer to the hub’s business key. After the links are completely loaded, link satellites can be loaded. This loading order is particularly an issue in real-time systems where this leads to unnecessary waiting states.
However, the bigger issue is the implied requirement for lookups. In order to load links, the business keys’ sequence must be figured out first by performing a lookup with the business key from the source against the hub, where the sequence is generated for each new business key loaded. This puts a lot of I/O performance on the database disk level.
Business keys are another option, if the database engine supports joining on business keys (and especially multi-part business keys) efficiently. This is true for many distributed databases but not true for more traditional, relational database engines, where the third (and default) option is used: hash keys. Since it is possible to mix different database technologies (e.g., using a distributed database, such as Microsoft Fabric, in the Azure cloud and a Microsoft SQL Server on premise), one might end up in an unnecessarily complex environment where parts of the overall solution are identified by business keys and other parts are identified by hash keys.
Therefore, most clients opt for the use of hash keys in all databases, regardless of the actual capabilities. Hash keys offer the advantage of consistent query performance across different databases and easy to formulate join conditions that don’t span across many columns. This is the reason why we also decided to use hash keys in a previous article and throughout this blog series.
When using hash keys (or business keys), all Data Vault entities can be loaded in parallel, without any dependencies. Eventual consistency will be reached as soon as the data of one batch (or real-time message) has been fully processed and loaded into the Raw Data Vault. However, immediate consistency is not possible anymore when the entities are loaded in parallel, which is the recommendation. But it is certainly possible (but beyond the scope of this article) to guarantee the consistency of query results, which is sufficient in analytical (and most operational) use cases.
It is also recommended to use hash difference from satellite payload to streamline delta detection during loading procedures for Data Vault satellites. Satellites are delta-driven, i.e. only entirely new records and records in which at least one of the attributes contains a change, will be loaded. In order to perform the delta detection, incoming records shall be compared with the previous ones. Without a hash diff in place, this has to be done by comparing records column by column, which can have a negative impact on performance of loading processes. Therefore it is highly recommended to perform change detection using the fixed-length hash difference.
Loading Raw Data Vault Entities
The next sections discuss how to load the Raw Data Vault entities, namely hubs, links, and satellites. We will keep the main focus on standard entities.
The following figure drafts a Data Vault model from a few data tables of the CRM source system of a retail business.
The following sections will present the loading patterns for objects that are marked in the above figure.
Data Vault stage
Before talking about loading the actual Data Vault entities, we must first explore the Data Vault stage objects, where incoming data payloads are prepared for the following loading processes.
In Fabric, it is recommended to create Data Vault stages as views. This is to leverage caching in Fabric Data Warehouse. While the same stage view can be used to populate different target objects (hubs, links, satellites,…), the query behind it will be executed only once and its results will be automatically stored in the database’s cache, ready for access by subsequent loading procedures from the same stage. This technique practically eliminates the need for materializing stage objects, as per the traditional approach.
Data Vault stages also calculate hash key and hash difference values respectively from business keys and descriptive attributes. It is important to note that within a stage object, there may be more than one hash key, as well as more than one hash diff, in case the stage object should feed data to multiple target Hubs, Links and Satellites, which is common practice.
Moreover, Data Vault stages prepare the insertion of so-called ghost records. These are artificially generated data records added to Data Vault objects, which contain default/dummy values. To read more about ghost record and its usage, please visit: Implementing Ghost Records.
The example code script below creates a Data Vault stage view from source table store_address. This will in the next step be utilized to load Hub Store and its satellite Store Address.
CREATE VIEW dbo.stage_store_address_crm AS
WITH src_data AS (
SELECT
CURRENT_TIMESTAMP AS load_datetime,
‘CRM.store_address’ AS record_source,
store_id,
address_street,
postal_code,
country
FROM dbo.store_address
),
hash AS (
SELECT
load_datetime,
record_source,
store_id,
address_street,
postal_code,
country,
CASE
WHEN store_id IS NULL THEN ‘00000000000000000000000000000000’
ELSE CONVERT(CHAR(32), HASHBYTES(‘md5’,
COALESCE(CAST(store_id AS VARCHAR),”)
),2)
END AS hk_store_hub,
CONVERT(CHAR(32), HASHBYTES(‘md5’,
COALESCE(CAST(address_street AS VARCHAR),”) + ‘|’ +
COALESCE(CAST(postal_code AS VARCHAR),”) + ‘|’ +
COALESCE(CAST(country AS VARCHAR),”)
),2) AS hd_store_address_lroc_sat
FROM src_data
),
Zero_keys AS (
SELECT
CONVERT(DATETIME, ‘1900-01-01T00:00:00’, 126) AS load_datetime,
‘SYSTEM’ AS record_source,
‘??????’ AS store_id,
‘(unknown)’ AS address_street,
‘(unknown)’ AS postal_code,
‘??’ AS country,
‘00000000000000000000000000000000’ AS hk_store_hub,
‘00000000000000000000000000000000’ AS hd_store_address_lroc_sat
UNION
SELECT
CONVERT(DATETIME, ‘1900-01-01T00:00:00’, 126) AS load_datetime,
‘SYSTEM’ AS record_source,
‘XXXXXX’ AS store_id,
‘(error)’ AS address_street,
‘(error)’ AS postal_code,
‘XX’ AS country,
‘ffffffffffffffffffffffffffffffff’ AS hk_store_hub,
‘ffffffffffffffffffffffffffffffff’ AS hd_store_address_lroc_sat
),
final_select AS (
SELECT
load_datetime,
record_source,
store_id,
address_street,
postal_code,
country,
hk_store_hub,
hd_store_address_lroc_sat
FROM hash
UNION ALL
SELECT
load_datetime,
record_source,
store_id,
address_street,
postal_code,
country,
hk_store_hub,
hd_store_address_lroc_sat
FROM ghost_records
)
SELECT *
FROM final_select
;
This rather lengthy statement is preparing the data and adding the ghost records: the first CTE src_data selects the data from the staging table and adds the system attributes, such as the load date timestamp and record source. The next CTE hash then adds the hash keys and hash differences required for the target model. Yes, that implies that the target model is already known but we have seen in the previous article how we derive the target model from the staged data in a data-driven Data Vault design. Once that is done, the target model for the Raw Data Vault is known and we can add the required hash keys (for hubs and links) and hash diffs (for satellites) to the staged data. This is done only virtually in Fabric – on other platforms, it might be required to actually add the hash values to the staging tables.
Another CTE zero_keys is used to generate two records to be used as zero keys in hubs and links and ghost records in satellites. This CTE populates the two records with default values for the descriptive attributes and the business keys. It is recommended to use default descriptions that one would expect to see for the unknown member in a dimension. The reason is that these two records will later be turned into two members in the dimension: the unknown member and the erroneous member.
The CTE final_select then unionizes the two datasets: the staging data provided by the CTE hash and the zero key records provided by ghost_records. The loading processes for the Raw Data Vault then use this dataset as the input dataset.
Loading Hubs
The first loading pattern that we want to examine is the hub loading pattern. Since a hub contains a distinct list of business keys, a deduplication logic must be carried out during the hub loading process to eliminate duplicates of the same business key from the Data Vault stage view. In the code script below, we aim to load only the very first copy of incoming hash keys/business keys into the target hub entity. This is guaranteed by the ROW_NUMBER() window function, and the WHERE condition rno (Row Number) = 0 at the end of the loading script.
In addition, we also perform a forward-lookup check to verify if the incoming hash key doesn’t exist in the target. Only then will it be inserted into the hub entity. This is done in the WHERE condition: [hash key from stage] NOT IN (SELECT DISTINCT [hash key] FROM [target Hub]).
The example code script below loads the hub Store with the business key Store ID:
WITH dedupe AS (
SELECT
hk_store_hub,
load_datetime,
record_source,
store_id,
ROW_NUMBER() OVER (PARTITION BY hk_store_hub ORDER BY load_datetime ASC) AS rno
FROM dbo.stage_store_address_crm
)
INSERT INTO DV.STORE_HUB
(
hk_store_hub,
load_datetime,
record_source,
store_id
)
SELECT
hk_store_hub,
load_datetime,
record_source,
store_id
FROM dedupe
WHERE rno = 1
AND hk_store_hub NOT IN (SELECT hk_store_hub FROM DV.STORE_HUB)
;
In the above statement, the CTE dedupe selects all business keys and adds a row number to select the first occurrence of the business key, including its record source and load date timestamp. Duplicates can exist for two reasons: either multiple batches exist in the staging area or the same business key appears multiple times in the same source of the single batch – for example, when a customer has purchased multiple products across multiple transactions in a retail.
The CTE is then input for the INSERT INTO statement into the hub entity. In the select from the CTE, a filter is applied to select only the first occurrence of the business key.
Loading Standard Links
Pattern similar to Hub loading – only the very first copy of incoming link hash keys will be loaded into the target link entity. Link hash key is calculated from the business key combination of the hubs that are connected by the Link entity.
The example code script below loads the link Store Employee with two Hub references from Hub Store and Hub Employee:
WITH dedupe AS (
SELECT
hk_store_employee_lnk,
hk_store_hub,
hk_employee_hub,
load_datetime,
record_source
FROM (
SELECT
hk_store_employee_lnk,
hk_store_hub,
hk_employee_hub,
load_datetime,
record_source,
ROW_NUMBER() OVER (PARTITION BY hk_store_employee_lnk ORDER BY load_datetime ASC) AS rno
FROM DV.stage_store_employee_crm
) s
WHERE rno = 1
)
INSERT INTO DV.store_employee_lnk
(
hk_store_employee_lnk,
hk_store_hub,
hk_employee_hub,
load_datetime,
record_source
)
SELECT
hk_store_employee_lnk,
hk_store_hub,
hk_employee_hub,
load_datetime,
record_source
FROM dedupe
WHERE hk_store_employee_lnk NOT IN (SELECT hk_store_employee_lnk FROM DV.store_employee_lnk)
;
Similar to the standard Hub’s loading pattern, the standard Link’s one starts with a deduplication process. Its goal is to only insert the very first occurrence of the combination of business keys being referenced in the Link relationship.
Then, the main INSERT INTO … SELECT statement also includes a forward look-up to the target Link entity, to only insert unknown Link hash keys into the Raw Vault.
Loading Non-Historized Links
In Data Vault 2.0, the recommended approach to model transactions, events or, in general, non-changing data is to utilize Non-historized Link entities – a.k.a. Transactional Link, discussed in our previous article to advanced Data Vault modeling.
The example code script below loads a Non-Historized Link that contains transactions made in retail stores – with two Hub references from Hub Store and Hub Customer:
WITH high_water_marking AS (
SELECT
hk_store_transaction_nlnk,
hk_store_hub,
hk_customer_hub,
load_datetime,
record_source,
transaction_id,
amount,
transaction_date
FROM DV.stage_store_transactions_crm
WHERE load_datetime > (
SELECT COALESCE(MAX(load_datetime), DATEADD(s, -1, CONVERT(DATETIME, ‘1900-01-01T00:00:00’, 126)))
FROM DV.store_transaction_nlnk
)
),
dedupe AS (
SELECT
hk_store_transaction_nlnk,
hk_store_hub,
hk_customer_hub,
load_datetime,
record_source,
transaction_id,
amount,
transaction_date
FROM (
SELECT
hk_store_transaction_nlnk,
hk_store_hub,
hk_customer_hub,
load_datetime,
record_source,
transaction_id,
amount,
transaction_date,
ROW_NUMBER() OVER (PARTITION BY hk_store_transaction_nlnk ORDER BY load_datetime ASC) AS rno
FROM high_water_marking
) s
WHERE rno = 1
)
INSERT INTO DV.store_transaction_nlnk
(
hk_store_transaction_nlnk,
hk_store_hub,
hk_customer_hub,
load_datetime,
record_source,
transaction_id,
amount,
transaction_date
)
SELECT
hk_store_transaction_nlnk,
hk_store_hub,
hk_customer_hub,
load_datetime,
record_source,
transaction_id,
amount,
transaction_date
FROM dedupe
WHERE hk_store_transaction_nlnk NOT IN (SELECT hk_store_transaction_nlnk FROM DV.store_transaction_nlnk)
;
The main difference between the loading patterns for Standard Links and Non-Historized Links lies within the so-called high-water marking logic in the first CTE of the same name. This logic only lets records through from the DV stage object that have a technical load_datetime that occurs after the latest one found in the target Link entity. This allows us to “skip” data records that have been processed by previous DV loads, effectively reduces the workload on the data warehouse’s side.
Loading Standard Satellites
Now to the more complicated loading patterns within a Data Vault 2.0 implementation, which are for Satellite entities. While querying from the stage, only records with a load date timestamp (LDTS) exceeding the latest load date from that target satellite will be fetched for further processing.
Note that these queries are more enhanced compared to other examples you would find elsewhere. The reason is that they are optimized for loading all data from the underlying data lake, even multiple batches at once, but in the right order. Especially for satellites, this poses a challenge as Data Vault satellites are typically delta-driven in order to save storage and improve performance.
A few principles from the above loading patterns for Hubs and Links also apply for Satellite – such as the deduplication logic. This eliminates hard duplicates (i.e. records with identical data attribute values) and, combined with the aforementioned filtering of old load date timestamps, aims to reduce the amount of incoming data records.
WITH stg AS (
SELECT
hk_store_hub,
load_datetime,
record_source,
hd_store_address_crm_lroc_sat,
address_street,
postal_code,
country
FROM DV.stage_store_address_crm
WHERE load_datetime > (
SELECT COALESCE(MAX(load_datetime), DATEADD(s, -1, CONVERT(DATETIME, ‘1900-01-01T00:00:00’, 126)))
FROM DV.store_address_crm_lroc_sat
)
),
dedupe_hash_diff AS (
SELECT
hk_store_hub,
load_datetime,
record_source,
hd_store_address_crm_lroc_sat,
address_street,
postal_code,
country
FROM (
SELECT
hk_store_hub,
load_datetime,
record_source,
hd_store_address_crm_lroc_sat,
COALESCE(LAG(hd_store_address_crm_lroc_sat) OVER (PARTITION BY hk_store_hub ORDER BY load_datetime), ”) AS prev_hd,
address_street,
postal_code,
country
FROM stg
) s
WHERE hd_store_address_crm_lroc_sat != prev_hd
),
dedupe_hard_duplicate AS (
SELECT
hk_store_hub,
load_datetime,
record_source,
hd_store_address_crm_lroc_sat,
address_street,
postal_code,
country
FROM (
SELECT
hk_store_hub,
load_datetime,
record_source,
hd_store_address_crm_lroc_sat,
address_street,
postal_code,
country,
ROW_NUMBER() OVER(PARTITION BY hk_store_hub ORDER BY load_datetime DESC) AS rno
FROM dedupe_hash_diff
) dhd
WHERE rno = 1
),
latest_delta_in_target AS (
SELECT
hk_store_hub,
hd_store_address_crm_lroc_sat
FROM (
SELECT
hk_store_hub,
hd_store_address_crm_lroc_sat,
ROW_NUMBER() OVER(PARTITION BY hk_store_hub ORDER BY load_datetime DESC) AS rno
FROM
DV.store_address_crm_lroc_sat
) s
WHERE rno = 1
)
INSERT INTO DV.store_address_crm_lroc_sat
SELECT
hk_store_hub,
load_datetime,
record_source,
hd_store_address_crm_lroc_sat,
address_street,
postal_code,
country
FROM dedupe_hard_duplicate
WHERE NOT EXISTS (
SELECT 1
FROM latest_delta_in_target
WHERE latest_delta_in_target.hk_store_hub = dedupe_hard_duplicate.hk_store_hub
AND latest_delta_in_target.hd_store_address_crm_lroc_sat = dedupe_hard_duplicate.hd_store_address_crm_lroc_sat
)
;
The first CTE stg selects the batches from the staging table where the load date timestamp is not yet in the target satellite. Batches that are already processed in the past are ignored this way.
After that, the CTE dedupe_hash_diff removes non-deltas from the data flow: records that have not changed from the previous batch (identified by the load date timestamp) are removed from the dataset.
Next, the CTE dedupe_hard_duplicate removes those records from the dataset where hard duplicates exist. This statement assumes that a standard satellite is loaded, not a multi-active satellite.
The CTE latest_delta_in_target retrieves the latest delta for the hash key from the target satellite to perform the delta-check against the target.
Finally, the insert into statement selects the changed or new records from the sequence of CTEs and inserts them into the target satellite.
Calculating the Satellite’s End-Date
Typically, a Data Vault satellite contains not only a load date, but also a load end date. However, the drawback of a physical load end date is that it requires an update on the satellite. This is done in the load end-dating process after loading more data into the satellite.
This update is not desired. Nowadays, the alternative approach is to calculate the load end date virtually in a view on top of the satellite’s table. This view provides the same structure (all the attributes) of the underlying table and, in addition, the load end date, which is calculated using a window function.
CREATE VIEW DV.store_address_crm_lroc_sat AS
WITH enddating AS (
SELECT
hk_store_hub,
load_datetime,
COALESCE(
LEAD(DATEADD(ms, -1, load_datetime)) OVER (PARTITION BY hk_store_hub ORDER BY load_datetime),
CONVERT(DATETIME, ‘9999-12-31T23:59:59’, 126)
) AS load_end_datetime,
record_source,
hd_store_address_crm_lroc_sat,
address_street,
postal_code,
country
FROM DV.store_address_crm_lroc0_sat
)
SELECT
hk_store_hub,
load_datetime,
load_end_datetime,
record_source,
CASE WHEN load_datetime = CONVERT(DATETIME, ‘9999-12-31T23:59:59’, 126)
THEN 1
ELSE 0
END AS is_current,
hd_store_address_crm_lroc_sat,
address_street,
postal_code,
country
FROM enddating
;
In the CTE enddating the load end date is calculated using the LEAD function. Other than that, the CTE selects all columns from the underlying table.
The view’s select statement is then calculating an is_current flag based on the load end date, which comes in handy often.
Outlook and conclusion
This concludes our article on loading standard entities of the Raw Data Vault. They provide the foundation for the loading patterns of the advanced Data Vault entities, such as non-historized links or multi-active satellites, with only minor modifications, which we typically discuss in our blog at https://www.scalefree.com/blog/ and in our training. In a subsequent article, we will demonstrate how to automate these patterns using Vaultspeed to improve the productivity (and therefore the agility) of your team. However, before we get there, we will continue our journey downstream in the architecture of the data platform and discuss in our next article how to implement business rules in the Business Vault.
About the Authors
Michael Olschimke is co-founder and CEO at Scalefree International GmbH, a Big Data consulting firm in Europe. The firm empowers clients across all industries to take advantage of Data Vault 2.0 and similar Big Data solutions. Michael has trained thousands of data warehousing individuals from the industry, taught classes in academia, and published on these topics regularly.
Trung Ta is a senior BI consultant at Scalefree International GmbH. With over 7 years of experience in data warehousing and BI, he has been advising Scalefree’s clients in different industries (banking, insurance, government,…) and of various sizes in establishing and maintaining their data architectures. Trung’s expertise lies within Data Vault 2.0 architecture, modeling, and implementation, with a specific focus on data automation tools.
<<< Back to Blog Series Title Page
Microsoft Tech Community – Latest Blogs –Read More