Data-Vault-Model

Scalable Data Storage with Historization and Revision Capabilities

July 22, 2024 - Tommy Dräger

The DataVaultModel (DVM) is my favorite when it comes to database management. It's incredibly scalable and perfect for handling the constant changes customers demand in structure. Sure, the initial setup can be a bit overwhelming, but once you get the hang of it, the structure and rules make everything run smoothly. In this article, I'll share why DVM could be the right choice for you, especially if your project is requiring full historization and revision capabilities.

In this section, we will take a large First Normal Form (1NF) table and logically transform it into a Data Vault model structure. This includes breaking down the large table into smaller, more manageable pieces (dont worry we explain those later):

hubs (b_ "base tables")
links (c_ "connection tables")
satellites (d_ "detail tables")

We will also add crucial metadata fields to maintain historical data, support multi-tenancy, and track data version.

Original 1NF Table

Let's take this step-by-step with a real-world example to understand how to break down a large 1NF table into the Data Vault model. We'll use a retail scenario involving customers, orders, and products.

TransactionID	CustomerName	CustomerEmail	OrderDate	ProductName	ProductCategory	Quantity	Price
1	John Doe	[email protected]	2024-03-12	Laptop	Electronics	1	1200
2	Jane Smith	[email protected]	2024-03-13	Smartphone	Electronics	2	800
3	John Doe	[email protected]	2024-03-12	Mouse	Electronics	1	20
4	John Doe	[email protected]	2024-03-12	CPU	Hardware	2	800

Step 1: Identifying Base Hub Tables (b_)

The first step is to look for data classes that could be split apart. In our case we have cases where we see redundant data like "John Doe" or "Electronics". So a first instinct would be to break off the Customer and Products. The goal of this practise is to find a case where no redundancy will ever occur. So if you think you just need a seperate customer table or product table, then think again! Because Product Category as well as Prices are data that will have duplicates over and over again.

Info:

_bid (bussines ID) a unique number for a entity, Example: 12
_sid (source Id) a unique string for a entity, Example "cust_001"

Tasks

Take your time to break this top table into smaller tables
give each table the name {b_example}
give every b_ 2 columns (example_sid (VARCHAR) & example_bid (INTEGER))
Look at the solutions to check

Step 1: Solutions

b_customer

customer_bid	customer_sid
1	cust_001
2	cust_002

b_product

product_bid	product_sid
1	prod_001
2	prod_002
3	prod_003
4	prod_004

b_category

category_bid	category_sid
1	cat_001
2	cat_002

b_price

price_bid	price_sid
1	pri_001
2	pri_002
3	pri_003

b_order

order_bid	order_sid
1	order_001
2	order_002
3	order_003
4	order_004

b_transaction

transaction_bid	transaction_sid
1	trans_001
2	trans_002
3	trans_003
4	trans_004

Step 2: Creating Details Sattelite Tables (d_)

Now, we associate additional descriptive details (such as email addresses, product descriptions, or customer addresses) with the corresponding hubs.

create a new table call it d_example
add a column example_bid
add details that are connected with that entity (things like name, description, other _bids)
keep the naming consistent (not "email" but "customer_email" to show the association)
name external keys (not "product_category_bid" but "category_bid")
look at the solutions below

d_customer

customer_bid	customer_name	customer_email
1	John Doe	[email protected]
2	Jane Smith	[email protected]

d_product

product_bid	product_name	category_bid	price_bid
1	Laptop	1	1
2	Smartphone	1	2
3	Mouse	1	3
4	CPU	2	2

d_price

notice that this is the true price. cause the 1NF table also considered the quantity to calculate the final price

price_bid	price_amount
1	1200
2	400
3	20

d_order

order_bid	order_date
1	2024-03-12
2	2024-03-13
3	2024-03-12
4	2024-03-12

d_category

category_bid	category_name
1	Electronics
2	Hardware

d_transaction

transaction_bid	transaction_quantity
1	1
2	2
3	1
4	2

Step 3: Creating Connections Link Tables (c_)

This step involves creating tables that define the relationships between different entities, such as which customer made which order or which products are included in which transactions. A rule of thumb is to think about m:n or 1:m relationship. Example: 1 Transaction can have multiple Products. 1 Customer can make multiple Order

create a new table c_example
add two _bid keys that are connection A_bid and B_bid
finally add a keys that is a combination of both _sids A_sid + B_sid
use something like "__" or "##" "." to seperate the sids

c_transaction_product

transaction_bid	product_bid	transaction_product_sid
1	1	trans_001__prod_001
2	2	trans_002__prod_002
3	3	trans_003__prod_003
4	4	trans_004__prod_004

c_customer_order

customer_bid	order_bid	customer_order_sid
1	1	cust_001__order_001
2	2	cust_002__order_002
1	3	cust_001__order_003
1	4	cust_001__order_004

Step 4: Adding Historization, Version, Multitenants etc

In this step, we extend our Data Vault model with additional metadata fields to enhance data governance, support temporal data management, enable multi-tenancy, and track data changes and origins. This expansion is crucial for ensuring robust, scalable, and secure data architectures.

Historization and Versioning

Historization refers to the process of keeping historical data within the database, allowing users to query the database as it was at any point in time. This is crucial for audits, trend analysis, and compliance with various regulations. Versioning works alongside historization to keep track of changes over time, providing insights into data evolution.

Multitenancy

Multitenancy allows multiple users or groups (tenants) to use the same instance of a software application, with data security and isolation. In databases, this is often achieved by including tenant identifiers in the data model, ensuring that data is partitioned and accessible only by the authorized tenant.

Fields Explained

table_rid	A unique record identifier for each entry in the table, typically auto-incremented.
table_rid_last	The identifier (table_rid) of the previous version of the record.
table_rid_next	The identifier (table_rid) of the subsequent version of the record.
valid_dt_from	The timestamp marking when the record becomes valid.
valid_dt_to	The timestamp marking when the record ceases to be valid.
tenant_aid	The identifier for the tenant, supporting multi-tenancy.
dbuser_aid	The identifier for the database user who made the last change.
interface_aid	Identifies the platform through which the data was entered or last updated.
audit_sid	A string identifier linking to an audit trail of changes.

customer_bid	customer_name	customer_email	table_rid	table_rid_last	table_rid_next	valid_dt_from	valid_dt_to	tenant_aid	dbuser_aid	interface_aid	audit_sid
1	John Doe	[email protected]	1	NULL	NULL	2024-01-01 00:00:00.000000+00	9999-12-31 23:59:59.999999+00	1	42	3	'audit123'
2	Jane Smith	[email protected]	2	NULL	NULL	2024-01-02 00:00:00.000000+00	9999-12-31 23:59:59.999999+00	1	42	3	'audit456'

add those 9 columns to every table with no exception

Next Steps: writing vd_VIEWS

the next step would be to create the first level of Views, this views task is it to select the most recent data entry of b_table and join it with most recent data entry if a d_table.

create a new view and name vd_example
select b_example
join d_example
make sure to only select the most recent base entry
make sure to only select the most recent detail entry from the join

Example:

CREATE OR REPLACE VIEW core.vd_customer AS  
SELECT      hub.customer_bid
,           hub.customer_sid
,           sat.customer_email
,           sat.customer_name
,           hub.tenant_aid
FROM        core.b_customer hub
JOIN        core.d_customer sat
ON          sat.customer_bid = hub.customer_bid
WHERE       CURRENT_TIMESTAMP >= hub.valid_dt_from 
AND         CURRENT_TIMESTAMP <= hub.valid_dt_to 
AND         CURRENT_TIMESTAMP >= sat.valid_dt_from 
AND         CURRENT_TIMESTAMP <= sat.valid_dt_to
AND         sat.table_rid_next IS NULL
ORDER BY    hub.customer_bid;

customer_bid	customer_sid	customer_name	customer_email
1	cust_001	John Doe	[email protected]
2	cust_002	Jane Smith	[email protected]

Last Steps: writing 1NF VIEWS

the last step is it to create the user Views. If you look back at the Original 1NF Table from the top. Its now time to write a query to create View that resembles exactly the same columns that was given in the beginning.

create a new view and name it pr_example
select and join ONLY vdVIEWS! never join c, b or d tables directly
rename certain columns like AS (category_bid >> ProductCategory)

Conclusion

The Data Vault Model is a way of organizing your database that's great for keeping track of changes over time and adapting to new requirements easily. It breaks down big tables into smaller, more manageable parts, which is especially useful for big companies or ones that handle a lot of data.

Compared to other methods like Inmon's and Kimball's, Data Vault is particularly good when you need to keep a detailed history of your data, like in banking or insurance companies. This is because it's built to handle changes smoothly — like updating or deleting data — without losing any historical information.

In short, if you're dealing with lots of data and expect changes or need clear records, Data Vault could be the right choice. It's like having a detailed, adaptable archive that grows with your business.

Data-Vault-Model

Scalable Data Storage with Historization and Revision Capabilities

July 22, 2024 - Tommy Dräger

Original 1NF Table

Step 1: Identifying Base Hub Tables (b_)

Tasks

Step 1: Solutions

b_customer

b_product

b_category

b_price

b_order

b_transaction

Step 2: Creating Details Sattelite Tables (d_)

d_customer

d_product

d_price

d_order

d_category

d_transaction

Step 3: Creating Connections Link Tables (c_)

c_transaction_product

c_customer_order

Step 4: Adding Historization, Version, Multitenants etc

Historization and Versioning

Multitenancy

Fields Explained

Next Steps: writing vd_VIEWS

Example:

Last Steps: writing 1NF VIEWS

Conclusion

2025-07-06

Cosine Palette Designer

2024-12-31

Adlib Editor

2024-12-21

VGA-Mode 03h

2024-12-17

PC-Speaker 61h

2024-07-22

Data-Vault-Model

2024-07-22

Pacific League Major

2024-07-20

VGA-Mode 13h