DataJoinery's Data Vault Key Concepts

Entities

Key concepts: people, places, or things.

Hub

A key concept that has a set of defined attributes.

Hub & Link

A key concept with relevant attributes about the entity itself, as well as its key relationships with other entities.

Dependent-child Link

A key entity that does not exist without the instance of its parent entity. For example, an Order Item is a DCL, but it is meaningless without an instance of the associated Order information.

Relationships

Relationships indicate inter-relationships between key entities. Relationships have a direction. The relationship direction can be one-to-one, one-to-many, zero-to-many, many-to-many.

Link (Default)

A standard link that can contain a relationship between two or more related entities.

Derived (Exploration) Link

A link that is derived from the data, not via database referential integrity links (i.e. Primary Keys and Foreign Keys).

Hierarchical Link

A link that represents a hierarchy between one entity to itself (self-referential), or two separate entities.

Same-as Link

A link that indicates that two instances are the same thing. It can be defined as a link between two instances of the same entity, or a link of two instances of separate entities.

Satellites

Satellites contain the descriptive properties of an entity or relationship. Satellites are loaded according to data feeds, from various sources of data. The attributes can be split across three aspects:

  • Rate-of-change: how often the attribute changes, i.e. High, Medium, or Low. This can reduce storage usage.

  • Privacy: personal or non-personal data. Separating personal data from the other data assists with GDPR and POPIA compliance as it helps secure the PII information, and easily forget a party if required.

  • Security: for example top-secret, public data, internal employees only. Different levels of security can be applied to the attributes, ensuring better governance around who can have access to the different types of data.

Non-Historized Data

Satellites can be classified as containing non-historized data. Non-historized data does not change. It is therefore unnecessary to check the HDIFF to see if the record has changed before loading it, instead it should be loaded regardless. This improves the load performance as unnecessary processing is avoided.

System-generated Satellites

Status Tracking Satellite

Status Tracking Satellites are applied to all satellite types (those associated to a Hub and those associated to a Link). The purpose of the satellite is to track if an instance of the attributes were Inserted (I), Updated (U), or Deleted (D). Deleted instances ensure that the values are not included in downstream calculations or reporting.

Effectivity Satellite

The Effectivity Satellite applies to Links only, and tracks the time period that an instance of a relationship was active. It is used to determine if a current relationship is active.

Business Satellites

Business Satellites (BSAT) describe an entity and can contain business logic to transform the data into meaningful information. The entity definition that was defined in the DataJoinery Conceptual Model automatically defines a BSAT per entity. The BSAT entity includes the fields that were defined as important attributes about the entity.

Point-in-time Tables

Point-in-time (PIT) tables take snapshots of the current version of the satellite data at a point-in-time. The tables store the Hash Key (HK), and the latest Load Datetime (LDTS) of the current version of each associated satellite (including the status tracking satellites). These speed up performance significantly, without having to replicate the raw data.

Bridge Tables

Also referred to as Summaries in DataJoinery, Bridge (BRG) tables store the current Hub, Link and other user-defined metrics at a point-in-time. This differ from PIT tables because they store the combination of any number of Hubs, and Links, whilst PIT tables are per Hub.

Bridge tables form the basis of Facts.

Data Vault Physical Entity Architecture

DataJoinery Data Vault Concepts

Keys

Business Keys

A business key is one or more ‘natural’ keys that uniquely identifies the instance of an entity. For a bank account, it could be a bank account number. For an Order, it could be an Order ID.

Hash Keys

The encryption of the combination of the business keys using a cryptographic hash function such as MD5 or SHA1 to produce a unique key.

Business Key Collision Code (BKCC)

DataJoinery includes a business key collision code in the Hash Key calculation to easily avoid the collision of two business keys from different systems. This is defaulted to the string ‘default’ and can be overridden when using the mapping wizards, or when editing the logical entity.

Driving Keys

In a link, a Driving Key is used to determine the constant Hub’s Hash Key in the relationship. As relationships can change over time, the effectivity of a combination of Hubs is determined using the Driving Key.

Driving Key Example

Using the example of a customer and their location, a Link can store the relationship between a Customer hub and Location hub. If a customer moves to another location, the relationship instance of the Customer and the old Location needs to be set as no longer valid, and the new relationship between the Customer and the new Location is now the active combination. As the Customer remains constant, the Customer’s HK will be set as the Driving Key.

Dependent-Child Link Hash Keys

As the dependent-child link (DCL) differs from a normal Hub and Link, the DCL has a HK as well as a DCL_HK which is made up of its business keys, as well as the business keys of the Parent entity and the Parent entity’s related entities (that are not a many-to-many relationship). For example, if an Order is linked to a Customer, an Order Item DCL_HK must include its Order Item ID, Order ID, and Customer ID as the Business Key calculation. The Order_Item_HK will be the Business Keys defined on the Conceptual Entity and will serve as the driving key.

Determining the type of Relationship using the Conceptual Model

People, Places, Things - Hubs

A conceptual entity that represents a standard Hub, such as a person, place, or thing, should be defined as having a Hub Conceptual Entity Type. This will create a _H object in the Enterprise Data Model (data vault model).

Events and Processes - Hubs and Links

A conceptual entity that is defined by instances of specific events but exist as an instance of an entity as well, such as orders, invoices, events; should be classified as a Hub & Link Conceptual Entity Type. The will create a _H object and a _L object in the Enterprise Data Model (data vault model).

If the entity contains numerous relationships to other entities where the entity is on the right-side of the one-to-one, one-to-many, zero-to-many relationship (i.e. the entity contains the foreign keys to other entities), one _L entity for the Link will be create and will contain the HK’s of the other related entities (excluding any many-to-many relationship types). Hierarchical, Same-as, or Derived (Exploratory) relationships will not be included on the _L entity but will exist in separate entities.

Dependent-Child Links

If a conceptual entity cannot exist without an instance of its parent entity, it should be classified as a Dependent Child Link Conceptual Entity Type. This will create a _DCL object in the Enterprise Data Model (data vault model).

The DCL will include its own HK, as well as the HK of its parent and the HK’s of the entities the parent is related to (excluding any many-to-many relationship types). Hierarchical, Same-as, or Derived (Exploratory) relationships will also not be included on the _DCL entity but will exist in separate entities.

Link Naming Convention

Standard Links

A link between two entities (entity A and entity b) will have the name entity_a_entity_b_l. The _L suffix can be overridden in the project settings.

Hub & Link Links

Entities that are defined as Hub & Link, such as and entity call Entity A, produce links with the name entity_a_l.

Dependent-Child Links

Dependent-child link for entity A creates an object with the name entity_a_dcl.

Default, Derived, Same-as, Hierarchical Relationship Types

Default Relationships

If a relationship is classified as a default relationship, the following logic is applied.

Link between a Hub and a Hub & Link Hub

If the entity on the right-hand side of the relationship is a Hub & Link, the entity can contain more than on Hub. The LINK Entity will contain the HK of itself (the Hub of the Hub & Link entity), as well as the HK’s of the entity on the left-side of the relationship for all relationships where:

    1. the Hub & Link entity is on the right-side
    2. where the relationship direction is not many-to-many
    3. where the relationship type is Default

 

The Hash Key that uniquely identifies the Link will be called L_HK and will include a combination of business keys and bcc’s of all of the related Hubs.

Link between two different Hubs

A LINK entity will be created that is includes both the left-side entity and the right-side entities name in the name of the link. I.e. a relationship between Entity A and Entity B will be entity_a_entity_b_l. the Link’s HK will be entity_a_entity_b_l_hk and will include a combination of business keys and bcc’s of all of the related Hubs. The name for entity_a_entity_b can be overridden by providing a name for the relationship.

Link between two Hubs that are the same

If a Default relationship is defined where the left and right entity are the same, a Hierarchical link entity type will be defined. The hierarchical link entity will be created that is includes the entities name, i.e. entity_hl. The Link’s HK will be entity_hl_hk and will include a combination of business keys and bcc’s for each Hub instances.

The name for entity can be overridden by providing a name for the relationship.

Link between a Hub and a Dependent-Child Link

If a Dependent-Child Link is related to a Hub that is not the parent (on either the left hand-side or right-hand side of the relationship), the Link entity must include all of the HK’s on the DCL (its own and those on its parent), as well as the HK’s of the entity it is related to.

The name of the entity will be _L, as it is a normal link. The name for entity can be overridden by providing a name for the relationship.

Derived (Exploratory) Relationships

Derived relationships are based on the data and not on foreign key constraints. They appear in green on the Conceptual Model to indicate that they contain some logic.

Derived relationships can be defined under the Conceptual Relationships and have the Relationship Type set to Derived. The name of the relationship can be overridden to ensure a more descriptive name is used.

Link between two or more Entities (either Hub, Hub & Link, or DCL)

If a Derived link is defined between two entities, the relationship will be an _EL and will include all of the HK’s on the Hub, Link, or DCL (its own and those on its parent), as well as the HK’s of the entity it is related to. All relationships that have a Derived Relationship Type and the same relationship name must be combined into one entity.

The name of the entity will be _EL, and can be overridden by providing a name for the relationship.

Same-as Relationships

Link between two Entities (same or different)

Same-as relationships are used to deduplicate data. If a relationship is classified as Same-as, a _SAL entity must be created. The naming will use the combination of the entities on the left- and right-hand side, and can be overridden.

The link entity must include all of the HK’s on the Hub, Link, or DCL (its own and those on its parent), as well as the HK’s of the entity it is related to.

Hierarchical Relationships

Link between two Entities (same or different)

A relationship that is classified as Hierarchical will follow the same logic as the Default relationship type for two fo the same entities. The hierarchical link entity will be created that is includes the entities name, i.e. entity_hl. The Link’s HK will be entity_hl_hk and will include a combination of business keys and bcc’s for each Hub instances.

The name for entity can be overridden by providing a name for the relationship.