Data Modelling Workout 03 - Denormalizing and Fixing Dimensions

BrianJ · April 25, 2023, 3:48am

Difficulty Rating: 2 out of 5

The phone rings. It’s a call from a potential new client (an office supplies store) who says that their Power BI report isn’t working and asks if you can help. You tell them to send over their data model populated with some sample data and you’ll take a look.

As soon as you look at the model, you see three big problems:

The first problem is the Products table. They’ve got it set up as a snowflake structure, which while not technically wrong is going to cause problems with the slicers syncing properly, is likely to decrease performance of the report, and make the DAX more complicated. You definitely are going to de-normalize this down to a single Products table incorporating category, subcategory and product.
The second problem is the Customer table. This should have a one-to-many relationship with the Orders table, not a many-to-many. This problem is caused by the developer misinterpreting the address information. This is not the customer address but is the destination shipping address. Thus, you need to remove this information from Customer table and move place the city field in the Destinations table. Once that’s done, fix the relationship between the Customer table and the Orders table.
Finally, the Destinations table currently has no relationship to the fact table. You could connect the tables with the Postal Code field, but you’d rather use an integer key, and if you change the postal code to an integer you will lose the leading zero on some of the ZIP Codes. So, instead create a new numeric Destination ID key field (1001, 1002, 1003…) linking the Destinations and Orders tables.

You call the client back and explain what you found. They agree to your hourly rate and ask you to go ahead and fix the report.

To document that you’ve got it working correctly, post a screenshot of your revised data model and a screenshot of the report page with each of the slicers activated (the blank one gets the moved City field). Good luck!

I will post my solution, Sunday April 30.

Data Modelling Workout 03 - Denormalize and Fix Dimensions.pbix (826.6 KB)
Data Modeling 003.xlsx (1.4 MB)

Ondrej · April 28, 2023, 8:52am

My attempt

My Solution

Salil · April 28, 2023, 8:44pm

@Ondrej looks awesome. Filters flowing downhill, Brian Julius, Rob Collie and Matt Allington will be very happy. Spend extra time to get Start Schema right then let the Data Model and DAX dance to your tunes!!

jtej001 · April 28, 2023, 10:51pm

Hi, Think I got the solution working
Please review
Data Modelling Workout 03 - Denormalize and Fix Dimensions.pbix (827.6 KB)

AFTER (1)

BrianJ · April 28, 2023, 10:59pm

@Ondrej ,

Bingo! Well done – thanks so much for your participation here and in the other workouts!

– Brian

BrianJ · April 29, 2023, 12:59am

@jtej001 ,

Spot on! Nailed all three of the fixes.

Also, liked your “before and after” gif. Thanks for participating!

Hope you found it useful.

Brian

SamMcKay · April 29, 2023, 2:08am

Love the animation. Well done

keith.f · April 29, 2023, 10:19pm

Hakeem · April 30, 2023, 12:38am

I have a really good feeling that these data modeling challenges will help me be very impactful to my team. Most of our issues lies within our data model. The only thing is that we do not have anyone who is super advanced at data modeling. I can’t wait to see what’s in store for these upcoming challenges.

This was a nice challenge. Here is my submission:

BrianJ · April 30, 2023, 5:06am

@keith.f ,

Bang on - well done! Data model is now a shiny star schema, ready to roll.

Thanks for participating!

Brian

BrianJ · April 30, 2023, 5:07am

@Hakeem,

As usual, spot on! Great to hear you’re finding these helpful.

Thanks for participating, and for your great LinkedIn post on the data modeling workouts earlier tonight.

Keep up the great work!

Brian

TomiwaB · April 30, 2023, 11:03am

I hope I am not too late. Please see my attempt below:

Schema

Report Page

Thank you!

JarrettM · April 30, 2023, 3:30pm

Here is my solution:

My Solution

Thanks
Jarrett

jany.gibson · April 30, 2023, 6:59pm

Here is my solution, I hope I did it correctly.
I always make ID fields as text; please advise why would we want an integer Primary Key ID. Thanks!

Data Modelling Workout 03 - Denormalize and Fix Dimensions Franziola.pbix (952.6 KB)
Data Modeling Workout 3-Denormalize and fix Dimensions.pdf (336.7 KB)

BrianJ · April 30, 2023, 10:39pm

This workout covers a lot of common issues/problems in getting a good star schema data model set up.

PRODUCTS TABLE
One of the biggest takeaways I think is regarding collapsing the snowflake dimensions into a denormalized dimension for the Products table.

Procedurally, the conversion is straightforward - just merge the Category and Subcategory tables with the Products table to bring over the Category and Subcategory fields. Then unload the Subcategory and Category tables and move them to the “Staging Queries” group - no reason to keep them in the active data model once all the relevant info is captured in the Products table.

Note that while this will introduce some duplication into the Products table, due to the efficiency of the Power BI Vertipaq compression algorithm the “cost” of this in terms of increased file size is minimal.

We can look at the effects directly using DAX Studio, by going to Advanced then View Metrics options. We can see that for the snowflake (normalized) version of the Products table, the statistics on size are as follows:

Now for the denormalized Products table after the 2 merges, the statistics are:

We can see that the Products table has grown from 219.2K to 254.6K, BUT we no longer have to load the Category or Subcategory tables, so the total space associated with the entire Products dimension has actually DECREASED from 255.8K to 254.6K. And at the same time, we now have a simpler structure where the DAX is easier to write and all the Product-related slicer fields now synch perfectly.

TIP: when merging, always be sure to check that the number of matching elements is what you think it should be.

CUSTOMERS TABLE
Our main task here is splitting out the City field from the Address string and then moving that to the Destination table to resolve the initial many-to-many problem with the Customer table.

But there are a couple of potential pitfalls to avoid.

The first one involves splitting the Address field. When you have to split a column multiple times, the order of the splits often matter. In this case, you need to split by comma first to avoid potentially splitting the city name by space.

Now at this point, things can get a bit tricky. If you merge now with the Destination table to bring the City field over to the Destination table, when you delete the Address-related fields from the Customer table, you are going to get a dependency error in the Destination table.

The way I avoided this was to reference the Customer Table at this point and rename it “Customer Staging”. Then I can join Customer Staging and Destination to bring the City field over, unload Customer Staging and then delete all the prior Address fields from Customer table - solving the dependency problem…

DESTINATION TABLE
For the Destination Table we first ensure that we have no duplicates in the table, and then create the Index field, choosing Custom and starting at 1001 and incrementing by one.

We can then merge the Destination table with the Orders table to pull the numeric key into the Orders table.

Generally, integer keys are faster to join and index compared to text keys, as they require less storage, are more efficient in terms of computation, and are easier to maintain - potentially reducing referential integrity violations (i.e., keys present in the fact table, but missing in the dimension table.

However, it’s not a hard and fast rule, as text keys can be more informative and may be required to link with external systems, so you will need to evaluate the specifics of your particular use case before making a decision. But generally speaking, integer keys are preferred.

Once we do this we can close and apply and reconnect our data model using the proper fields.

Ooooh, look at that pretty star schema!

The types of problems you addressed in this workout you will see over and over again in the real world. So, I hope you found this helpful.

Thanks to all for participating, and @Greg and I will be back tomorrow with a new workout.

Data Modelling Workout 03 - Denormalize and Fix Dimensions - Brian Julius Solution.pbix (831.7 KB)

BrianJ · April 30, 2023, 10:44pm

@jany.gibson ,

For some reason, Power BI is not letting me open your PBIX file, by looking at your PDF file it all looks spot on!

Per your question about text versus integer keys, I have addressed that in my solution writeup. Thanks for participating – hope you found this helpful!

– Brian

BrianJ · April 30, 2023, 10:48pm

@JarrettM ,

The Products and Customer tables look good, but it seems you still have one more step to go (creating the sequential integer key to join Destinations table with the Orders table).

Brian

BrianJ · April 30, 2023, 10:50pm

@TomiwaB ,

No such thing as “too late”. These workouts are meant to be durable exercises you can do at any time.

Looks like you nailed all three problems spot on - nice work.

Thanks for participating!

jany.gibson · May 3, 2023, 3:45am

Thank you, very thorough and helpful, I appreciate the integer key explanation, I now have another perspective.

jany.gibson · May 3, 2023, 3:51am

Thanks Brian, I’ll check tomorrow if there was a sensitivity label applied to the .pbix file tomorrow and will download some more challenges as they are very instrumental.