Data-Vault-Model


Scalable Data Storage with Historization and Revision Capabilities


July 22, 2024 - Tommy Dräger


The DataVaultModel (DVM) is my favorite when it comes to database management. It's incredibly scalable and perfect for handling the constant changes customers demand in structure. Sure, the initial setup can be a bit overwhelming, but once you get the hang of it, the structure and rules make everything run smoothly. In this article, I'll share why DVM could be the right choice for you, especially if your project is requiring full historization and revision capabilities.

In this section, we will take a large First Normal Form (1NF) table and logically transform it into a Data Vault model structure. This includes breaking down the large table into smaller, more manageable pieces:

  • hubs (b_ "base tables")
  • links (c_ "connection tables")
  • satellites (d_ "detail tables")

We will also add crucial metadata fields to maintain historical data, support multi-tenancy, and track data version.

Original 1NF Table

Let's take this step-by-step with a real-world example to understand how to break down a large 1NF table into the Data Vault model. We'll use a retail scenario involving customers, orders, and products.

TransactionID CustomerName CustomerEmail OrderDate ProductName ProductCategory Quantity Price
1 John Doe [email protected] 2024-03-12 Laptop Electronics 1 1200
2 Jane Smith [email protected] 2024-03-13 Smartphone Electronics 2 800
3 John Doe [email protected] 2024-03-12 Mouse Electronics 1 20
4 John Doe [email protected] 2024-03-12 CPU Hardware 2 800

Step 1: Identifying Base Hub Tables (b_)

The first step is to look for data classes that could be. In our case we have cases where we see redundant data like "John Doe" or "Electronics". So a first instinct would be to break off the Customer and Products. The goal of this practise is to find a case where no redundancy will ever occur. So if you think you just need a seperate customer table or product table, then think again! Because Product Category as well as Prices are data that will have duplicates over and over again.

Info:

  • _bid (bussines ID) a unique number for a entity, Example: 12
  • _sid (source Id) a unique string for a entity, Example "cust_001"

Tasks

  • Take your time to break this top table into smaller tables
  • give each table the name {b_example}
  • give every b_ 2 columns (example_sid (VARCHAR) & example_bid (INTEGER))
  • Look at the solutions to check

Step 1: Solutions

b_customer

customer_bid customer_sid
1 cust_001
2 cust_002

b_product

product_bid product_sid
1 prod_001
2 prod_002
3 prod_003
4 prod_004

b_category

category_bid category_sid
1 cat_001
2 cat_002

b_price

price_bid price_sid
1 pri_001
2 pri_002
3 pri_003

b_order

order_bid order_sid
1 order_001
2 order_002
3 order_003
4 order_004

b_transaction

transaction_bid transaction_sid
1 trans_001
2 trans_002
3 trans_003
4 trans_004

Step 2: Creating Details Sattelite Tables (d_)

Now, we associate additional descriptive details (such as email addresses, product descriptions, or customer addresses) with the corresponding hubs.

  • create a new table call it d_example
  • add a column example_bid
  • add details that are connected with that entity (things like name, description, other _bids)
  • keep the naming consistent (not "email" but "customer_email" to show the association)
  • name external keys (not "product_category_bid" but "category_bid")
  • look at the solutions below

d_customer

customer_bid customer_name customer_email
1 John Doe [email protected]
2 Jane Smith [email protected]

d_product

product_bid product_name category_bid price_bid
1 Laptop 1 1
2 Smartphone 1 2
3 Mouse 1 3
4 CPU 2 2

d_price

notice that this is the true price. cause the 1NF table also considered the quantity to calculate the final price

price_bid price_amount
1 1200
2 400
3 20

d_order

order_bid order_date
1 2024-03-12
2 2024-03-13
3 2024-03-12
4 2024-03-12

d_category

category_bid category_name
1 Electronics
2 Hardware

d_transaction

transaction_bid transaction_quantity
1 1
2 2
3 1
4 2

This step involves creating tables that define the relationships between different entities, such as which customer made which order or which products are included in which transactions. A rule of thumb is to think about m:n or 1:m relationship. Example: 1 Transaction can have multiple Products. 1 Customer can make multiple Order

  • create a new table c_example
  • add two _bid keys that are connection A_bid and B_bid
  • finally add a keys that is a combination of both _sids A_sid + B_sid
  • use something like "__" or "##" "." to seperate the sids

c_transaction_product

transaction_bid product_bid transaction_product_sid
1 1 trans_001__prod_001
2 2 trans_002__prod_002
3 3 trans_003__prod_003
4 4 trans_004__prod_004

c_customer_order

customer_bid order_bid customer_order_sid
1 1 cust_001__order_001
2 2 cust_002__order_002
1 3 cust_001__order_003
1 4 cust_001__order_004

Step 4: Adding Historization, Version, Multitenants etc

In this step, we extend our Data Vault model with additional metadata fields to enhance data governance, support temporal data management, enable multi-tenancy, and track data changes and origins. This expansion is crucial for ensuring robust, scalable, and secure data architectures.

Historization and Versioning

Historization refers to the process of keeping historical data within the database, allowing users to query the database as it was at any point in time. This is crucial for audits, trend analysis, and compliance with various regulations. Versioning works alongside historization to keep track of changes over time, providing insights into data evolution.

Multitenancy

Multitenancy allows multiple users or groups (tenants) to use the same instance of a software application, with data security and isolation. In databases, this is often achieved by including tenant identifiers in the data model, ensuring that data is partitioned and accessible only by the authorized tenant.

Fields Explained

table_rid A unique record identifier for each entry in the table, typically auto-incremented.
table_rid_last The identifier (table_rid) of the previous version of the record.
table_rid_next The identifier (table_rid) of the subsequent version of the record.
valid_dt_from The timestamp marking when the record becomes valid.
valid_dt_to The timestamp marking when the record ceases to be valid.
tenant_aid The identifier for the tenant, supporting multi-tenancy.
dbuser_aid The identifier for the database user who made the last change.
interface_aid Identifies the platform through which the data was entered or last updated.
audit_sid A string identifier linking to an audit trail of changes.
customer_bid customer_name customer_email table_rid table_rid_last table_rid_next valid_dt_from valid_dt_to tenant_aid dbuser_aid interface_aid audit_sid
1 John Doe [email protected] 1 NULL NULL 2024-01-01 00:00:00.000000+00 9999-12-31 23:59:59.999999+00 1 42 3 'audit123'
2 Jane Smith [email protected] 2 NULL NULL 2024-01-02 00:00:00.000000+00 9999-12-31 23:59:59.999999+00 1 42 3 'audit456'
  • add those 9 columns to every table with no exception

Next Steps: writing vd_VIEWS

the next step would be to create the first level of Views, this views task is it to select the most recent data entry of b_table and join it with most recent data entry if a d_table.

  • create a new view and name vd_example
  • select b_example
  • join d_example
  • make sure to only select the most recent base entry
  • make sure to only select the most recent detail entry from the join

Example:

CREATE OR REPLACE VIEW core.vd_customer AS  
SELECT      hub.customer_bid
,           hub.customer_sid
,           sat.customer_email
,           sat.customer_name
,           hub.tenant_aid
FROM        core.b_customer hub
JOIN        core.d_customer sat
ON          sat.customer_bid = hub.customer_bid
WHERE       CURRENT_TIMESTAMP >= hub.valid_dt_from 
AND         CURRENT_TIMESTAMP <= hub.valid_dt_to 
AND         CURRENT_TIMESTAMP >= sat.valid_dt_from 
AND         CURRENT_TIMESTAMP <= sat.valid_dt_to
AND         sat.table_rid_next IS NULL
ORDER BY    hub.customer_bid;
customer_bid customer_sid customer_name customer_email
1 cust_001 John Doe [email protected]
2 cust_002 Jane Smith [email protected]

Last Steps: writing 1NF VIEWS

the last step is it to create the user Views. If you look back at the Original 1NF Table from the top. Its now time to write a query to create View that resembles exactly the same columns that was given in the beginning.

  • create a new view and name it pr_example
  • select and join ONLY vdVIEWS! never join c, b or d tables directly
  • rename certain columns like AS (category_bid >> ProductCategory)

Conclusion

The Data Vault Model is a way of organizing your database that's great for keeping track of changes over time and adapting to new requirements easily. It breaks down big tables into smaller, more manageable parts, which is especially useful for big companies or ones that handle a lot of data.

Compared to other methods like Inmon's and Kimball's, Data Vault is particularly good when you need to keep a detailed history of your data, like in banking or insurance companies. This is because it's built to handle changes smoothly — like updating or deleting data — without losing any historical information.

In short, if you're dealing with lots of data and expect changes or need clear records, Data Vault could be the right choice. It's like having a detailed, adaptable archive that grows with your business.