How to generate surrogate keys for SCD fast in Power BI

Martin · August 24, 2020, 1:59pm

Hi,
I have to deal with slowly changing dimensions in my data and need the history for my model, i.e. valid from and valid to dates in addition to each ID, e.g. PersonID. My approach is to create a surrogate key for each combination of PersonID and ValidFrom date in the Dimension table, add the correct surrogate key to each fact record in the fact table and create the one to many relationship in the data model between the surrogate key columns. So far nothing special and it works.
The issue is the loading time after adding the creation of the surrogate keys in Power Query. My first approach was creating a custom function to look up a surrogate key from the dimension table. It works and the nice thing about this soltion is that it needs no coding at all in Power Query and thus can be taught to anyone without coding knowledge. But loading of about 80,000 fact records took 168 seconds which means quite some waiting time.
Since I had no baseline to compare against I created two views on the Microsoft SQL-Server (on premises), one for the dimension table and one for the fact table, both including the surrogate keys. Now loading took only 11 seconds for doing the same things! There are no materialized transformations on the SQL-Server, which would speed up loading even more, so all the same things are done just in time as in Power Query before. The views essentially look like this and the key seems to be that the fact view does not only use “=” comparison as join criteria.

Dimension (People) including creation of surrogate key as a hash from PersonID and ValidFrom:

SELECT
CONVERT(varchar(40), HASHBYTES(‘MD2’, { FN CONCAT(CAST(ValidFrom AS nvarchar(32)), CAST(PersonID AS nvarchar(16))) }), 2) AS SurrogateKey,
PersonID,
FullName,
ValidFrom,
ValidTo
FROM WideWorldImporters.Application.People_Archive
WHERE (ValidFrom <> ValidTo)

Fact table (Invoice) with joined surrogate key

SELECT
oltp.InvoiceID,
bi.SurrogateKey AS SalespersonSurrogateKey,
oltp.SalespersonPersonID,
oltp.InvoiceDate
FROM WideWorldImporters.Sales.Invoices AS oltp
INNER JOIN WwiPowerBiViews.dbo.People AS bi
ON oltp.SalespersonPersonID = bi.PersonID
AND oltp.InvoiceDate >= bi.ValidFrom
AND (oltp.InvoiceDate < bi.ValidTo
OR NULL = bi.ValidTo)

In order to optimize the Power Query solution I tried to maximize query folding and chose a design as close as possible to the views. I managed to completely fold the dimension table query to the SQL-Server, but for the fact table I used an inline function to build the join and this breaks query folding. So in the end the solution was the slowest with a mindblowing 426 seconds.
Is there a way to build a faster solution in Power BI instead of SQL-Server views, e.g. by better use of query folding? Is there a way to express the SQL JOIN I used in the view in Power Query so it folds to the SQL-Server? What would be the fastest approach to add these surrogate keys in Power BI?
I have attached all three Power BI models for your reference, and the source data is the WideWorldImporters database available from Microsoft.
PBIX w/ custom function: https://drive.google.com/file/d/1XxIqA6YiJa3JIbEi3ZHx3HZnMSpGmKoe/view?usp=sharing
PBIX w/ views: https://drive.google.com/file/d/1rfLpNWuICuMTnIh1MaCtUW1cflU676QQ/view?usp=sharing
PBIX w/ inline function: https://drive.google.com/file/d/1KG2zWIGBhbdyxIhXNQVxZXJuw-Es3RO6/view?usp=sharing
Database: https://github.com/Microsoft/sql-server-samples/releases/tag/wide-world-importers-v1.0
BR
Martin

EnterpriseDNA · August 25, 2020, 2:01am

Hi @Martin, we aim to consistently improve the topics being posted on the forum to help you in getting a strong solution faster. While waiting for a response, here are some tips so you can get the most out of the forum and other Enterprise DNA resources.

Use the forum search to discover if your query has been asked before by another member.
When posting a topic with formula make sure that it is correctly formatted to preformated text </>.
Use the proper category that best describes your topic
Provide as much context to a question as possible.
Include demo pbix file, images of the entire scenario you are dealing with, screenshot of the data model, details of how you want to visualize a result, and any other supporting links and details.

I also suggest that you check the forum guideline https://forum.enterprisedna.co/t/how-to-use-the-enterprise-dna-support-forum/3951. Not adhering to it may sometimes cause delay in getting an answer.

adesinamk · August 28, 2020, 11:21am

Hi Martin,

First thing is to identify all the dimensions or lookup tables from the Application People_Archive.

Go through the link below and follow it to be able to generate a surrogate key and make use of it in the sales table. The first thing is to identify the dimensions or lookup tables

If there is any issue let me know. If it addresses your challenge then tick it as solution to help others.

Martin · August 28, 2020, 1:10pm

Hi @adesinamk,
Thank you for looking into this issue. My first apporach was pretty close to your solution, including creating an index column in the dimension table, which btw. breaks query folding so I would not recommend it for large datasets loaded from a SQL-Server.
The issue with my solution is that it makes loading the data much slower than a solution that folds to the SQL-Server.
Since I’m dealing with slowly chaging dimensions that have valid from and valid to date columns your solution is not applicable to my problem. Your solution suggests to merge columns to generate a column with unique values in both tables that can be used for a merge. My dimension records look e.g. like this:
… | Person ID | Valid From | Valid To
… | 12345 | 2018-10-12 | 2020-05-18
And my fact records look like this
… | Fact Date | Person ID
… | 2019-03-12 | 12345
99% of the facts do not occure at the same date at which, by chance, also a dimension value changes, so I cannot merge the surrogate keys to the fact table based on a concatenated column create from Fact Date and Person ID. So which columns should I use? That’s why I ended up not using the Power Query table merge to add the surrogate key to the fact table, but instead I use a lookup function. And that’s where loading starts getting slow. Is there a way to speed up loading? Either the Power Query engine is ineffictive in general or my solution is ineffective by it’s specific design, but the SQL server does the same job in a fraction of time. And not because it’s a faster machine. I run SQL-Server and Power Query on the same machine in my development environment.
BR
Martin

Martin · August 31, 2020, 10:42am

I have found a solution, at least for limited size of dataset, which is not so nice in design but nicer in performance:

I created a date column with all consecutive dates for which I have valid dimension values (not dimension value changes) up to today.
I unpivoted a table consisting of ValidFrom date, PersonID (my dimension value ID) and surrogate key from the dimension table so that I got a matrix: PersonID (columns) by Valid From date (rows) with the surrogate keys in the cells.
Then I merged the unpivoted dimension table with the generated dates column based on valid from date with a right outer join, meaning I’ll keep all date rows so that I’ll get conscutive dates on my matrix.
Then I sort this table by the merged, consecutive date column in descending order
Then I used fill down function in all PersonID columns so that for each date I’ve got the valid surrogate key.
Then I removed the date column that originated as ValidFrom column from the dimension table.
Then I unpivoted the table so that I have columns date, PersonID, and surrogate key, for each consecutive date.
Now I can join this table with the fact table and get for any fact date the correct surrogate key.

With this approach loading takes 20 seconds instead of 11 seconds when doing everything on the SQL-Server. This is still almost double the time but way faster then calling a Power Query custom function per fact row to lookup the surrogate key which took 168 seconds.
What worries me a bit is that with a lot of dimension vales this matrix will become huge, consuming a lot of memory and potentially becoming slow again. So, if there is no other fast approach in Power Query, obviously there is a good reason to do this on the SQL-Server in a real life scenario with millions of dimension values (think e.g. of customers of a phone company) and not in Power Query.
I’ve attached the solution as a reference: https://drive.google.com/file/d/1rfLpNWuICuMTnIh1MaCtUW1cflU676QQ/view?usp=sharing