Modern Data Modeling

10 Rules for Effective Dimensional Modeling for Data in the cloud

Richie Bachala
14 min readMar 22, 2023

--

In this blog post, I’ll share the modern data modeling concepts that I’ve implemented as standards across my former team of data engineers and modelers at a Fortune 200 enterprise. Specifically, I’ll delve into the roles of data modeler and data engineer in designing a physical database, while addressing the two dominant approaches of data-driven and function-driven modeling. I’ll also explore alternative data modeling approaches that leverage the virtually infinite computing power of the cloud, highlighting the costs and benefits of each approach. Finally, I’ll discuss the continued relevance of data modeling in the context of cloud computing, and where it can offer the most value for businesses pertaining to data warehouses.

Design Principles

Our attempt to guide an organization in effective data management and utilization. To achieve this, the principles suggest maximizing enterprise-wide benefits, valuing data as an enterprise asset, collaborating across business units and IT organizations, and addressing technical debt. The principles also emphasize the need for reusing existing solutions, innovating, and integrating consistently to enhance efficiency, reduce effort, and improve decision-making.

Image by Author.

Business Data Model Foundation

The Business Data Model Foundation is a non-technical, conceptual model that helps organizations build a business strategy based on their data and models future scenarios with reliable analysis. It involves gathering leadership support for strategic goals and includes several components, such as the Business Dimensional Model (BDM) which encompasses business terms, logical data structures, and their relationships. Additionally, the model incorporates the Enterprise Dimensional Model, which provides a high-level overview of the organization’s data landscape, and the Bus Matrix, which outlines the integration points between business processes and data entities.

When it comes to building a business strategy, the analytics support decision-making is all data-driven. The process aligns people, work, and competencies along with the business strategy and objectives. Considerations for organizational design alongside workforce planning for future preparations make this process future ready.

To support a strategic goal, leadership support is gathered to create a tactical campaign using multiple tactics to approach the right audience and gather feedback to adjust the message.

When the above strategies are accepted and approved by the organization, an Enterprise Dimensional Model is architected to consolidate all Facts and Dimension tables patterned after the Kimball Bus Matrix, a Unified entity-level view of Facts and Dimension tables, and modeling standards including naming conventions.

The Bus Matrix identified each row of the matrix corresponding to a business process or event within the organization while columns correspond to common dimensions of the business. The Matrix delivers the big picture perspective regardless of the database technology while identifying reasonably managed development efforts.

Bus architecture allows for incremental data warehouse and business intelligence development to decompose the planning process along with conformed dimensions. A top-down strategy perspective is used to ensure data in DW/BI environment can be integrated across the enterprise while agile bottom-up delivery occurs per business process.

Dimensional Modeling

The dimensional model is a centralized data model designed to enable and support the highest levels of accuracy and quality for analytics.

Benefits of Dimensional Modeling

The methodology is widely recognized as a best practice for organizing data in the BI layer, with a focus on query performance, optimization, and extensibility. By using denormalized dimension hierarchies and star schema design, it eliminates the complex web of joins commonly found in normalized modeling. The database engine can then optimize queries by constraining the dimension tables and gathering relevant keys to filter the fact tables. This allows for the efficient processing of large datasets and provides a scalable solution for future growth. Another advantage is that the star schema design is more resilient to changes, such as ALTER TABLE commands, since the logic and data do not need to be reloaded.

Dimensional Model Definitions

Dimension Tables

Every dimension table is flat denormalized attribute data describing the business activity with many low-cardinality texts. The operational codes are populated with verbose descriptions of domain values. Dimensions reflect primary target constraints and grouping specifications from queries and BI applications.

Dimension Table Keys

Every dimension table has a single unique primary key column embedded as a foreign key in associated fact tables. Dimension tables’ primary keys should be sequences starting with 0 reserved for the “N/A” record for Non-applicable or Late Arriving Dimension stub.

Natural Keys

Natural Keys are operational source system keys that are durable identifiers.

Surrogate Keys

Surrogate keys are artificial keys substituting Natural keys usually as Sequenced numbers. Usually, the first key 0 is reserved for the “N/A” record of the Non-applicable or Late Arriving Dimension stub.

Fact Tables

Fact tables contain performance measurements generated by business activities or events that reflect transactional quantities or amounts. Each performance measure at the time of measurement event which is not attributed number can be aggregated. Fact tables are huge with millions or billions of rows but efficiently organized.

Fact Table Keys

Fact tables contain performance measurements generated by business activities or events that reflect transactional quantities or amounts. Each performance measure at the time of measurement event which is not attributed number can be aggregated. Fact tables are huge with millions or billions of rows but efficiently organized.

Fact Table Keys

Multipart keys made of foreign keys from intersecting dimensions tables involved in a business process. Every foreign key in a fact table must match a unique primary key in the corresponding dimension table.

Fact Table Granularity

The fact grain is the business definition of the measurement event that produces the fact row. A fact row is created when an event occurs having a single uniform grain determining the fact table primary key. Fact tables should contain the lowest most detailed atomic grain captured by the business process. Every fact table has at least one foreign key associated date dimension table such that the grain is a single calendar day.

Conformed Dimensions

Conformed Dimensions are standardized dimensions used to integrate data across the enterprise with consistent reusability across multiple fact tables delivering same content, and interpretation regardless of the business process involved. It has the same key, attribute names, attribute definitions, and domain value regardless of the fact table that they join to. Typical Conformed Dimensions have the lowest atomic granularity. Shrunken Conformed Dimension describe fact tables captured measurements at higher level of detail or aggregated at a lesser granular rolled up level for performance. A business policy decision makes the development teams use these dimensions.

Date & Time Dimensions

The Date and Time Dimensions are special dimensions as every fact table is a time series of observations at a daily or more precise grain giving many useful attributes for navigating, filtering and describing calendars.

The Date Dimension is a list of calendar attributes from a Date Key to all variants of date types, descriptions, and Indicator types (ex: Date, Day of Week, Month, Calendar and Fiscal Month and Year, Holiday Indicator, etc.) — all loaded 1 time. Relative Day, Relative Week, Relative Month, Relative Year, etc. — all refreshed at beginning of each day as the values change from day to day with respect to the current date. Intelligent Surrogate Date Key is recommended with the form of YYYYMMDD as primary key of the date dimension for a variety of reasons but not bypassing the date dimension.

The Time of Day should be treated as a dimension only if there is a meaningful textual description for periods within the day such as work shift, hourly events, etc. otherwise it may be expressed as a non-additive fact for a date/timestamp.

In addition to the date dimension and potential time of day dimension, we store full date/timestamp in the fact table to support precise time interval calculations across fact rows. For business needs for lead or lag time should be calculations performed in ETL. As for businesses spanning multiple time zones, the date and time of day may need to be expressed in local time and global standard UTC time. And last but not least, Calendar Navigation should be driven through the date dimension not through hard coded application logic.

Degenerate Dimensions

Degenerate Dimensions are transactional numbers identifying the Natural Keys without additional attributes in the Fact table (Ex: order number, invoice number, transactional timestamp). These can be Parent Header Keys used in the design of parent-child transaction systems, can live in the Fact Tables due to being a unique parent transactional Natural Key, and would have high cardinality as it would not benefit performance being in a dimensional table.

Slowly Changing Dimensions

Infrequently changing dimensions need to track time-variant dimension attributes required by business requirements. Some examples of typical slowly changing dimensions are entities as names of geographical locations, customers, or products.

Type 1 SCD

Old value has no business significance. Old value is just overwritten with a current value in a data row. Historically accurate associations are lost.

Type 2 SCD

New value has business significance to track changing values. Most popular to track changes in attribute values with the Natural Keys, Row Effective Date, Row Expiration Date and Current Row Indicator. This requires updates to prior row of the Row Expiration with the Current Date and Current Row Indicator set to No. Additionally, it requires a new row with a new Surrogate Primary Key, Row Effective Date of the Current Date, Current Row Indicator to Yes and the new attribute values.

Type 3 SCD

New Value has a “soft” change business significance that logically acts as if change has not occurred. As it is infrequently used as a new column is added to the dimension table. Alternatively, old attribute value is pushed into a “prior” attribute column and the new attribute is overwriting the existing column.

Mini-Dimensions

When Type 2 SCD dimension rows get too large (millions of rows), frequent changing attributes can break into their own separate dimensions of Branded Ranges and Restricted Growth.

Banded ranges of values clumped together into a new dimension. Limited by the specificity of data to change to different set of value bands at later time.

Restricts growth to itself and cannot be too large (less than 200K rows). Create another mini-dimension when exceeding the 200K rows limit.

Role-Playing Dimensions

Role-playing dimensions are same physical dimension plays distinct logical roles. Examples of such are multiple dimension values associated to each fact row (eg., Order Date Dimension, Invoice Date Dimension, Shipped Date Dimesnion). Additionally, self Joined dimension for each role-playing such as person dimension can role-play organizational role of employee, manager, director, etc.

Junk Dimensions

Junk Dimensions are miscellaneous attributes that can’t find an appropriate existing dimension. Place miscellaneous attributes into 1 or more junk dimensions to balance the dimension-to-fact table size ratio. Categorize attributes together like comments and put them into a junk dimension.

Outriggers

Outriggers occurs where a dimension is joined to another dimension for advantageous grain combination. Sometimes an advantageous acceptable snowflake situation can exist acceptable in certain cases where a dimension is joined to another dimension at a significant different grain than primary dimension. However, dimensional model should not be littered with Outriggers given the additional join will impact query performance.

Bridge Tables

Bridge tables are used when one-to-many relationships between a Fact row and Dimension rows. This can be in Hierarchies and Many-Valued Dimensions.

A bridge table containing the Parent Key, Bridge Surrogate Key, Level from Parent, Bottom Indicator, and Top Indicator of Hierarchy allows for Hierarchy Navigation and Rollup. The Fact row Foreign Key points to the Bridge Primary Surrogate Key and the Bridge Parent Key points to the Dimension table Primary Surrogate Key. Placing filters on Dimension and/or Bridge tables allows Rollup or Drilling of Fact table data.

Bridge table containing Primary Group Key, Individual Key, and Weight Factor are used for accurate Allocated Totals. The Fact row Foreign Key points to the Bridge Primary Group Key and the Bridge Individual Key points to the Dimension table Primary Surrogate Key. Sum of the Fact measure on each row multiplied by Bridge Weight Factor gives the correct Allocated Sum.

Three Fundamental Grains

Transaction gives high detail view, snapshots allow quick monitor overall performance, and accumulating snapshots gives performance results.

  1. Transaction Fact Tables gives a grain at its lowest level as one row per line of a transaction. Transaction grain is a point in space and time and measurements at transaction grain must be true at that moment.
  2. Periodic Snapshot Fact Tables captures the performance of a business process spanning a well-defined periodic interval of time. At predetermined interval snapshots are taken at the same level of detail and stacked consecutively in the fact table. Periodic snapshots complement the detailed transaction but is not a replacement but has far fewer dimensions.
  3. Accumulating Snapshot Fact Tables represent activity over an indeterminate time span for processes that have well-defined beginning and endings. It contains multiple foreign keys, each corresponding to a major milestone in the workflow. Many of the date values are unknown compared to transactional when the row is first loaded. Updates occur as milestone occurrences are captured.

Facts Allocation

Business Stewardship must be brought in to determine Weight Factor for Allocation rules. Allocation rules can often reach agreement by business and thus the measurements can live in a single fact table. When business can’t reach agreement for allocation and prohibits it so then a separate fact table has to be design for accommodations.

Factless Fact Tables

A Factless Fact Table is a business process where events occur without quantifiable measurements generated. Events that are important to the business that track dimension relationships via foreign keys but no numeric measurements.

Consolidated Fact Tables

A Consolidated Fact Table contains merged fact tables at the same granularity all in one fact table. Merged measurements from multiple processes in a single fact table using a single common level of granularity.

Designing Process of Dimensional Model

The design process of the Dimensional Model has four iterative process steps:

  1. Chose the Business Process — Business prioritization activity listed on the Bus Matrix.
  2. Declare the Grain — The level of detail in the fact table for the selected business process.
  • Grain declaration says exactly what a fact table measurement represents in business terms.
  • Most critical part of design process is to declare the grain.
  • Atomic Grain is the lowest granualrity a process can have.

3. Identify Dimensions

  • Determine dimensions and attributes applicable to the fact table at the stated level of granularity.
  • All dimensions of the Bus Matrix should be tested against the grain.

4. Identify the Facts

  • Identify measures from the business process.
  • Fact table grain establishes the foundation for determining the appropriate dimensions.
  • The facts must be true to the grain.

Data Model Review and Validation

Reviewing the model with the successive audience at different levels of business and technical expertise feedback will provide a more successful implementation path. Preparation provides some dimensional modeling education as part of the review process. A session flow review of basic dimensional modeling concepts (analytics vs. transactional) will clarify the path to success. Presentation to the Business Users illustrates how dimensional modeling supports their business requirements by starting with the Bus Matrix and how the model answers business questions. All these will further help develop the steps needed for success.

Finalize Design Documentation

Your design documentation is finished when you exhausted the issue list or project team whichever comes first. :) The list of completed steps indicated here are:

  • Brief description of the business process(es) including the design.
  • High-level discussion of business requirements supporting a detailed requirements document.
  • High-level data model diagram.
  • Detailed dimensional design for each fact and dimension.
  • Open issues list highlighting unresolved issues.
  • Discussion of known limitations of the design to limit project scope and business requirements.
  • Other items like design compromises or source data concerns.

Implementing Dimensional Architecture

Identifying Data Sources

Identify all raw source data on Premise or on Cloud as Raw tables enforcing a very thin source layer. Renaming fields to user-friendly names helps with debugging. Casting columns to appropriate data types helps data conformation. Use minimal transformations that are 100% guaranteed to be useful for foreseeable future. Placement in a logically named schema makes understanding easier. The source model should never remove any data, join to other tables nor transform the meaning of a column. And most important, source model should be considered the “raw” data.

Sensitive Data

Cases where raw values should not be exposed but making it available to legitimate need to know is necessary. This security can be established using Row-level Security, Column-level Security, and Data Masking Policies.

Row-level Security supports context functions and role hierarchy in the filter conditions.

Column-level Security supports contest functions and role hierarchy in the conditions of the Masking Policy body to enforce user authorization.

Data Masking Policies masks data using a full mask, hash, and encrypt/decrypt functions.

Staging

Staging is used for business-specific transformations layer tailored to analytical needs. Examples of this is for filtering irrelevant records, streamlining columns required for analytics, renaming columns to represent abstract business concepts, joining to other tables, executing business logic, and modelling of Facts and Dimensions tables following Kimball methodology.

Naming Standards

It is critical for self-service data environment to establish naming standards. Always use full words to describe model names where possible. Here are some of the Enterprise Data Warehouse naming standards:

Fact Tables: F_<verb> — Real-world process identified by transactional action (ex: ordered_qty, shipped_qty, sales_dollars, etc).

Dimension Tables: D_<noun> — Descriptive attribute context to the fact records because they are “nouns”. (ex: date, place, customer, product, etc).

Bridge Tables: B_<verb>_<noun> — Bridges the many-to-many relationships between a Fact and Dimension.

Singular naming should be used (ex: customer, product) where dimension attributes act as “adjectives” (ex: customer type, product description).

Prefixing table names — to group like data when sorted alphabetically (ex: d_geo_location, d_geo_region).

Primary and Foreign Key Constraints are not enforced in Snowflake data warehouses!

THE 10 RULES

Rule #1 — Load detailed atomic data into dimensional structures so it can support detail as well as summary reporting.

Rule #2 — Structure dimensional models around business processes that capture or generate performance metrics associated with each event.

Rule #3 — Ensure that every fact table has at least one foreign key associated date dimension table such that the grain is a single calendar day.

Rule #4 — Ensure that all facts in a single fact table are at the same grain or level of detail regardless of grain being transactional, periodic snapshot or accumulating snapshot.

Rule #5 — Resolve many-to-many relationships in fact tables using Bridge table in conjunction with the fact table.

Rule #6 — Resolve many-to-many relationships in dimension tables by denormalizing data.

Rule #7 — Store report labels and filter domain values in dimension tables as dimension attributes without nulls.

Rule #8 — Make certain that dimension tables use a surrogate key especially if you are tracking dimension attribute changes with a new dimension record for each profile change.

Rule #9 — Create conformed dimensions to integrate data across the enterprise with consistent reusability across multiple fact tables.

Rule #10 — Continuously balance requirements and realities to deliver a DW/BI solution accepted by the business users and that supports their decision-making.

SUMMARY

This blog post presents valuable and enduring introduction to data modeling, specifically using the Kimball methodology. It advises beginner data engineers to begin with a single data source for dimensional modeling due to the complexity of the topic, and emphasizes that mastering data modeling requires a journey of training and practice to achieve successful implementations. The post also highlights the importance of multiple iterations in the design process.

The content was derived from a comprehensive spreadsheet that we’ve developed over several years, this blog is my attempt to summarize a lot of it into a readable/bloggable format.

If you have any suggestions or comments, please feel free to reach out to me on Twitter (@richiebachala). I would love to hear from you and continue the conversation about data modeling best practices. Thank you again, and happy modeling!

--

--

Richie Bachala

Distributed SQL, Data Engineering Leader @ Yugabyte | past @ Sherwin-Williams, Hitachi, Oracle