When to create a dimension and not

Hi!

Are there any general recommendations for when to create a separate dimension table and when to simply keep the values in the fact table of as part of another dimension table?
Take the following example with the raw tables from a database (no transformation) of course the customer table could contain many more columns

Would it be better to keep this table structure or should you merge the Country and Gender values to the Customer table like this.

I guess that if I use the dimensions on other fact tables (other than sales) it makes sense to keep it as separate table but when (if ever) is it a good idea to keep the values in the dimension or fact table and not break it into its own table?

Thanks
Fredrik

1 Like

@Frede,

Your question is probably the most debated issue in data modeling - snowflake (your top configuration) vs. star schema (your bottom configuration). Here’s a good summary table of the pros and cons of each:

My personal view is to stick with star schema when at all possible. It makes for a simpler data model, and makes DAX easier to write and debug. But I don’t often work with really large datasets, where the denormalization in a star schema would be a bigger issue.

I hope this is helpful.

  • Brian

P.S. Here’s another recent thread on this issue with a link to an excellent Microsoft article on star and snowflake schemas:

Thanks Brian!

I guess there is no definitive right or wrong but the usual “it depends”. One advantage of keeping to a star schema IMO is that for end-users that want to create their own reports from scratch it is much more intuitive to look for the gender or country fields in the customer table, than to have to open the gender and country tables to find them and add to their visuals.

@Frede,

Totally agree. Snowflake schemas are definitely valid, and have their place, but I think in most cases star schema makes everyone’s life a lot easier.

I heard Marco Russo talk on this topic this summer and he said something to this effect: A star schema will work for almost every model. Many of you are probably sitting there thinking “yes, but my model is special”. I’m here to tell you today that your model is not special – use a star schema.

It got a big laugh and probably was a bit of an exaggeration to make a point, but I think it’s pretty good advice…

  • Brian
3 Likes

The star schema vs. snowflake schema question comes up pretty frequently, so I thought for future reference I would add this brief discussion to the thread, since I found it quite informative re: this issue:

  • Brian
1 Like

In terms of how the DAX engines are built and how the experts at the Kimball group suggest, always use a Star Schema. It’s the natural tendency to want to normalize the dimension tables. Just makes sense and feels “right” to have a Product Table then one with Product Category and the one with product subcategory. You would be better off leaving all of that in on Product table for a few reasons:

  1. You really arent saving that much space by breaking up the tables. The fact tables will generally be 20x larger (if not more) so any thoughts of saving space should go out the window

  2. If you normalize your data you will have many foreign keys in the original table. So if you had product table and broke it out to product category you would you have to have a Product Category foreign key in the Product table. Now since the cardinality (unique values) of the product category table values will generally be low (thus easily compressed by VertiPaq) there really isnt a cost to keeping the actual values vs. a foreign key.

  3. The more steps the dax engines has to do to transfer the filters from your dimension tables to your fact tables the potential for worse performance.

The only benefit that generally comes from snowflaking is it can be easier for the user to navigate, though I would contend that is more for actual pivot tables vs. slicers and such you can use in PBI.

So even if the dimension table is “bigger” when compared to the normalized version, doesnt really matter. what matters is the columns (i.e. the column size, data type, cardinality). When looking around the web be aware that you will come across alot of star vs. snowflake in terms of row based databases (i.e. sql). DAX and VertiPaq are column based, so things can and are different.

3 Likes

@Nick_M,

Thanks – terrific explanation. I’m sure this post will get referenced a lot in the future.

  • Brian

I agree with the above, whenever possible, a star schema is most logical.
However, I am going to add in an argument for Snowflake where appropriate.

Example, in my data set, I have a customer table, each customer has both Ship-To locations, and Departments. These are all dimension tables, my fact table tells me which customer ordered the product, for the specific Ship-To location and Department.

Sounds simple, right? :slight_smile:
Okay, but here’s the twist, a customer might have Ship-Tos A, B, C, & D - and Deparments 1, 2, & 3
Sales can be to any combination of those 4 Ship-Tos and 3 Departments, so if I normalize the customer to include every possible Ship-To and Department combo, for that one customer I will have 14 rows (12 for combos that include both Ship-To and Department, with another 2 for those that have only one of the two).
And, because this is a real-world scenario, yes, I have to go to the Ship-To/Department level, and I have additional detail columns for each.

In this case, with address columns for the Ship-Tos, Contacts (and additional info) for both Ship-To and Department, that really does bloat my customer table to be much larger than it needs to be.

And in this case, yes, I did it - I had 20 additional columns, and multiple rows for each customer, and it was a SLOOOooooW mess. So, I took a weekend, reconfigured to a snowflake, and things were much smoother.

So, yes, sometimes, it just ‘depends’ - try for a star schema whenever possible, but don’t close yourself off to the idea of a snowflake schema if things seem to be too slow.

2 Likes

Most definitely! I use them in my actual work as well. My main point is that if you dont have to, then you shouldnt. I’ve seen some models where things were overly normalized and that is no good for PBI. There’s too many variables out there to come up with hard and fast rule, just guidelines.