Referential integrity

Have a number of referential integrity issues.

There are 5 dimensions which join to multiple fact tables (Galaxy schema rather than just Star)

I have been to
Phil Seamark page on Referential integrity

This show me how to identify the issues.

What is considered best practice to resolve?

My first item is Sales and Products
Sales has 6,000,000 Rows
Product has 265,000

I am missing 1244 Rows in the produuct table that are in the Sales table

Assuming that I can t fix this in the data source is there a best practice for this?
I could Merge on a left join from the sales to the product and if there is none then use a default of -99?

As I am having issues with refresh times any complexity needs to be efficient.

Many thanks again
E

1 Like

@ells Referential integrity shouldn’t impact performance of a refresh, it is more of a problem during DAX Computation.

@AntrikshSharma
Yes. I agree but surely if I add a whole load of lookups and corrections to my queries it will slow down the refresh. So in the query foir the 6 million rows I check to see if the key to product exists in the product table and replace with a dummy will it have
a big impact / a small impact ?

I am suspecting a considerable impact and possibly a very considerable impact if done badly

Thanks
E

Hi there,

When connecting to a data source using DirectQuery , you can use the Assume Referential Integrity selection to enable running more efficient queries against your data source. This feature has a few requirements of the underlying data, and it is only available when using DirectQuery .

Setting Assume referential integrity enables queries on the data source to use INNER JOIN statements rather than OUTER JOIN , which improves query efficiency.

Please read the following article which can be useful:

1 - https://docs.microsoft.com/en-us/power-bi/transform-model/desktop-relationships-understand

2 - https://docs.microsoft.com/en-us/power-bi/connect-data/desktop-assume-referential-integrity

Good luck…

@IlgarZarbaliyev

Many thanks, however Inner join would remove records from the fact table which is not acceptable to users.

Will read through the two articles as that sounds like it could be usefull in other projects

Thanks
E

@ells As far as I know only few kind of functions would invoke a complex behaviour, one of them is ALLNOBLANKROW, I will have to test other functions to see if the behaviour changes. But to handle these kind of issues, engine adds a blank row automatically to the 1 side of relationship.

Can you show the metric of your model using DAX Studio, if the problem is in refresh then either you have complex power query transformations or there are columns in your model that are huge and causing the slowness.

I think I have meanaged to resolve the big number of missing keys. The main issue was a join where there were 2+ M Rows that had a blank reference for the key to the dimension. The rest appears to be timing issues but will check this later tonight when I can check the refreshed data set.

There are still about four joins where there are missing references (around 100 missing keys) however this is so low key that it is acceptable.

Many Thanks
E

Hi @ells, did the response provided by @AntrikshSharma and @IlgarZarbaliyev help you solve your query? If not, how far did you get and what kind of help you need further? If yes, kindly mark as solution the answer that solved your query. Thanks!

I think there may be no answer to the question. I still think if I get the referential integrity resolved this will add time to the refresh that I do not need.
Due to the size I need query performance to be as good as possible.

Thnaks
E

@ells do you have a data warehouse? If yes, you need to enforce referential integrity there. In your ETL process make sure that there are no:

  1. Fact tables loaded into the DWH with corrupt keys
  2. No dimension records/tables may be deleted which are used in fact table

If you need to solve it in PBI you could do the following:

  1. Lookup incorrect keys in query editor via left join from fact to dimension
  1. Change incorrect keys to a dimension key that does not yet exist (e.g. key 0)
  2. Add that key to the dimension table with attributes to denote that the key denotes a “missing attribute”

Kind regards

@Rens

I have no control over the data quality in the data warehouse.

I have looked at changing the foreign key in the fact tables that dont exist in the dimension table. I started to merge the fact and dimension to identify missing keys but unfortunatelly the refresh times then start to go to over 3 hours which is unaccepptable.

Either there is a better way than a left join merge and changing the key if there is no join or I will leave it as is.

As it is I think it reflects the poor data quality of the underlying systems.

Thanks
E

1 Like

I fully agree @ells. It reflects poor data quality. Is there a DWH in between? If yes, I hope the customer will forward this to the responsible colleauge. If not, and you are connecting directly from source, it might be interesting (but a big project) to look into a DWH option.

3 hours seem very long for a refresh. For a daily report it might be accepted if you run it nightly, but if you need to update throughout the day it is a red flag.

I assume you are on Power BI pro? I do not have experience in Power BI premium, but you might investigate that option.

Furthermore, there is another option, but will take some work. You might consider using Azure Analysis Service as a layer between your Power BI report and your database. This might help to speed up the process. Azure Analysis Service is then the place where you build your model. In AAS you can add more compute power so that the refresh goes faster. Afterwards, you query that model into Power BI…

Hi @ells, did the response provided by @rens help you solve your query? If not, how far did you get and what kind of help you need further? If yes, kindly mark as solution the answer that solved your query. Thanks!

@Rens
This is on premium.
With regards to the responsible people I am not sure what to say. It is my experience that since people have decided to move to an agile method of development that the quality has dropped. Not saying Agile is bad, just sayin that sometimes bad stuff gets through the cracks. My aim is to keep the cracks as small and manageable as possible.
This is not the worse place I have seen for data quality.
Thanks
E

1 Like

Hi @ells

They are mainly 3 places you can make changes wrt to Power BI i.e. Source, Power Query and Power Pivot(DAX)

PS: For me, best option is to get it fixed as source especially if you have ETL setup available.

There will be some disadvantages with all of above and you can choose what suits you the best. As all options are already covered, suggest you to close this topic.

Thanks
Ankit J

@ankit
Thanks. This has been a bit of a nightmare. No ETL, No chance to improve the quality of DWH data.

In the query editor I added a merge on each the three Fact tables to the dimensions that had issues. The really good part of this was it totally made a mess of one fact table. It took me a long tiome to realsie that the data warehouse was adding blank spaces at the end of one of my keys.

I still have issues between one fact table and one dimension table but as the users are still arguing over what they want I wont fix it till they decide what they want.

Many thanks for all the help. It took a long time to do and apoply the changes but it is now in a happy place.
Many Thanks
E

1 Like

Hi @ells, did the response provided by @ankit help you solve your query? If not, how far did you get and what kind of help you need further? If yes, kindly mark as solution the answer that solved your query. Thanks!

@AntrikshSharma
I have marked your response as the solution. While it took almost a day to go through the model and correct the RI issues there has been no downside in refresh time or performance.

The only downsides are when making a change to the query the time to apply is longer and also if I change the query for one dimension and apply the change it refreshes the three fact tables it is linked to

Many thanks for the help on the journey to resolve this.
E