Data-Vault-Model
Scalable Data Storage with Historization and Revision Capabilities
July 22, 2024 - Tommy Dräger
The DataVaultModel (DVM) is my favorite when it comes to database management. It's incredibly scalable and perfect for handling the constant changes customers demand in structure. Sure, the initial setup can be a bit overwhelming, but once you get the hang of it, the structure and rules make everything run smoothly. In this article, I'll share why DVM could be the right choice for you, especially if your project is requiring full historization and revision capabilities.
In this section, we will take a large First Normal Form (1NF) table and logically transform it into a Data Vault model structure. This includes breaking down the large table into smaller, more manageable pieces:
- hubs (b_ "base tables")
- links (c_ "connection tables")
- satellites (d_ "detail tables")
We will also add crucial metadata fields to maintain historical data, support multi-tenancy, and track data version.
Original 1NF Table
Let's take this step-by-step with a real-world example to understand how to break down a large 1NF table into the Data Vault model. We'll use a retail scenario involving customers, orders, and products.
TransactionID |
CustomerName |
CustomerEmail |
OrderDate |
ProductName |
ProductCategory |
Quantity |
Price |
1 |
John Doe |
[email protected] |
2024-03-12 |
Laptop |
Electronics |
1 |
1200 |
2 |
Jane Smith |
[email protected] |
2024-03-13 |
Smartphone |
Electronics |
2 |
800 |
3 |
John Doe |
[email protected] |
2024-03-12 |
Mouse |
Electronics |
1 |
20 |
4 |
John Doe |
[email protected] |
2024-03-12 |
CPU |
Hardware |
2 |
800 |
Step 1: Identifying Base Hub Tables (b_)
The first step is to look for data classes that could be. In our case we have cases where we see redundant data like "John Doe" or "Electronics". So a first instinct would be to break off the Customer and Products. The goal of this practise is to find a case where no redundancy will ever occur. So if you think you just need a seperate customer table or product table, then think again! Because Product Category as well as Prices are data that will have duplicates over and over again.
Info:
- _bid (bussines ID) a unique number for a entity, Example: 12
- _sid (source Id) a unique string for a entity, Example "cust_001"
Tasks
- Take your time to break this top table into smaller tables
- give each table the name {b_example}
- give every b_ 2 columns (example_sid (VARCHAR) & example_bid (INTEGER))
- Look at the solutions to check
Step 1: Solutions
b_customer
customer_bid |
customer_sid |
1 |
cust_001 |
2 |
cust_002 |
b_product
product_bid |
product_sid |
1 |
prod_001 |
2 |
prod_002 |
3 |
prod_003 |
4 |
prod_004 |
b_category
category_bid |
category_sid |
1 |
cat_001 |
2 |
cat_002 |
b_price
price_bid |
price_sid |
1 |
pri_001 |
2 |
pri_002 |
3 |
pri_003 |
b_order
order_bid |
order_sid |
1 |
order_001 |
2 |
order_002 |
3 |
order_003 |
4 |
order_004 |
b_transaction
transaction_bid |
transaction_sid |
1 |
trans_001 |
2 |
trans_002 |
3 |
trans_003 |
4 |
trans_004 |
Step 2: Creating Details Sattelite Tables (d_)
Now, we associate additional descriptive details (such as email addresses, product descriptions, or customer addresses) with the corresponding hubs.
- create a new table call it d_example
- add a column example_bid
- add details that are connected with that entity (things like name, description, other _bids)
- keep the naming consistent (not "email" but "customer_email" to show the association)
- name external keys (not "product_category_bid" but "category_bid")
- look at the solutions below
d_customer
d_product
product_bid |
product_name |
category_bid |
price_bid |
1 |
Laptop |
1 |
1 |
2 |
Smartphone |
1 |
2 |
3 |
Mouse |
1 |
3 |
4 |
CPU |
2 |
2 |
d_price
notice that this is the true price. cause the 1NF table also considered the quantity to calculate the final price
price_bid |
price_amount |
1 |
1200 |
2 |
400 |
3 |
20 |
d_order
order_bid |
order_date |
1 |
2024-03-12 |
2 |
2024-03-13 |
3 |
2024-03-12 |
4 |
2024-03-12 |
d_category
category_bid |
category_name |
1 |
Electronics |
2 |
Hardware |
d_transaction
transaction_bid |
transaction_quantity |
1 |
1 |
2 |
2 |
3 |
1 |
4 |
2 |
Step 3: Creating Connections Link Tables (c_)
This step involves creating tables that define the relationships between different entities, such as which customer made which order or which products are included in which transactions. A rule of thumb is to think about m:n or 1:m relationship. Example: 1 Transaction can have multiple Products. 1 Customer can make multiple Order
- create a new table c_example
- add two _bid keys that are connection A_bid and B_bid
- finally add a keys that is a combination of both _sids A_sid + B_sid
- use something like "__" or "##" "." to seperate the sids
c_transaction_product
transaction_bid |
product_bid |
transaction_product_sid |
1 |
1 |
trans_001__prod_001 |
2 |
2 |
trans_002__prod_002 |
3 |
3 |
trans_003__prod_003 |
4 |
4 |
trans_004__prod_004 |
c_customer_order
customer_bid |
order_bid |
customer_order_sid |
1 |
1 |
cust_001__order_001 |
2 |
2 |
cust_002__order_002 |
1 |
3 |
cust_001__order_003 |
1 |
4 |
cust_001__order_004 |
Step 4: Adding Historization, Version, Multitenants etc
In this step, we extend our Data Vault model with additional metadata fields to enhance data governance, support temporal data management, enable multi-tenancy, and track data changes and origins. This expansion is crucial for ensuring robust, scalable, and secure data architectures.
Historization and Versioning
Historization refers to the process of keeping historical data within the database, allowing users to query the database as it was at any point in time. This is crucial for audits, trend analysis, and compliance with various regulations. Versioning works alongside historization to keep track of changes over time, providing insights into data evolution.
Multitenancy
Multitenancy allows multiple users or groups (tenants) to use the same instance of a software application, with data security and isolation. In databases, this is often achieved by including tenant identifiers in the data model, ensuring that data is partitioned and accessible only by the authorized tenant.
Fields Explained
table_rid |
A unique record identifier for each entry in the table, typically auto-incremented. |
table_rid_last |
The identifier (table_rid) of the previous version of the record. |
table_rid_next |
The identifier (table_rid) of the subsequent version of the record. |
valid_dt_from |
The timestamp marking when the record becomes valid. |
valid_dt_to |
The timestamp marking when the record ceases to be valid. |
tenant_aid |
The identifier for the tenant, supporting multi-tenancy. |
dbuser_aid |
The identifier for the database user who made the last change. |
interface_aid |
Identifies the platform through which the data was entered or last updated. |
audit_sid |
A string identifier linking to an audit trail of changes. |
customer_bid |
customer_name |
customer_email |
table_rid |
table_rid_last |
table_rid_next |
valid_dt_from |
valid_dt_to |
tenant_aid |
dbuser_aid |
interface_aid |
audit_sid |
1 |
John Doe |
[email protected] |
1 |
NULL |
NULL |
2024-01-01 00:00:00.000000+00 |
9999-12-31 23:59:59.999999+00 |
1 |
42 |
3 |
'audit123' |
2 |
Jane Smith |
[email protected] |
2 |
NULL |
NULL |
2024-01-02 00:00:00.000000+00 |
9999-12-31 23:59:59.999999+00 |
1 |
42 |
3 |
'audit456' |
- add those 9 columns to every table with no exception
Next Steps: writing vd_VIEWS
the next step would be to create the first level of Views, this views task is it to select the most recent data entry of b_table and join it with most recent data entry if a d_table.
- create a new view and name vd_example
- select b_example
- join d_example
- make sure to only select the most recent base entry
- make sure to only select the most recent detail entry from the join
Example:
CREATE OR REPLACE VIEW core.vd_customer AS
SELECT hub.customer_bid
, hub.customer_sid
, sat.customer_email
, sat.customer_name
, hub.tenant_aid
FROM core.b_customer hub
JOIN core.d_customer sat
ON sat.customer_bid = hub.customer_bid
WHERE CURRENT_TIMESTAMP >= hub.valid_dt_from
AND CURRENT_TIMESTAMP <= hub.valid_dt_to
AND CURRENT_TIMESTAMP >= sat.valid_dt_from
AND CURRENT_TIMESTAMP <= sat.valid_dt_to
AND sat.table_rid_next IS NULL
ORDER BY hub.customer_bid;
Last Steps: writing 1NF VIEWS
the last step is it to create the user Views. If you look back at the Original 1NF Table from the top. Its now time to write a query to create View that resembles exactly the same columns that was given in the beginning.
- create a new view and name it pr_example
- select and join ONLY vdVIEWS! never join c, b or d tables directly
- rename certain columns like AS (category_bid >> ProductCategory)
Conclusion
The Data Vault Model is a way of organizing your database that's great for keeping track of changes over time and adapting to new requirements easily. It breaks down big tables into smaller, more manageable parts, which is especially useful for big companies or ones that handle a lot of data.
Compared to other methods like Inmon's and Kimball's, Data Vault is particularly good when you need to keep a detailed history of your data, like in banking or insurance companies. This is because it's built to handle changes smoothly — like updating or deleting data — without losing any historical information.
In short, if you're dealing with lots of data and expect changes or need clear records, Data Vault could be the right choice. It's like having a detailed, adaptable archive that grows with your business.