Complicated data

powerbideveloper · November 22, 2019, 10:06pm

So I have a few Power BI fact tables that have all kinds of data in it. I don’t have access to the data source, just the .pbix file.

Some of the issues:

monthly repeating employee data (empid, dob, gender, etc)
data corrections resulting in one month’s value being different from another month (one month the grade is a 10, in another month that same record has the value as 7)
yearly age differences
judging from my calculation differences, there are more issues that still need to be revealed

The Power BI data model and some reports have already created (by someone else).

Is there any way I can detect these issues in advance so I can account for them in my DAX calculations?

I’ll assume the answer is no, but I would appreciate a few ideas on how to approach this scenario. I have taken into account the yearly repeating data, but the differences in some are of concern and I am not clear on how to deal with those.

Thank you!

BrianJ · November 23, 2019, 9:26pm

@powerbideveloper ,

Trying to account for “dirty” data in your DAX calculations is difficult at best. Much better to clean the data in Power Query first. If what’s preventing you from doing that is that you only have the PBIX file and not the source data, then this is what I would suggest:

Extract the data from the PBIX files to .CSV using DAX Studio (see video below for a great trick on how to export all of your tables at once)
Bring the CSV files back into the PBIX file via Get Data
Clean the data in Power Query
Reset the relationships in your data model from the “dirty” tables to the clean ones

I do this regularly in responding to forum posts where the original poster only provides the PBIX file, and it works great.

Brian

powerbideveloper · November 23, 2019, 11:11pm

Brian,

Yes, I am familiar with that great technique of extracting the data using Dax Studio, but thanks for confirming my instinctive thoughts of cleaning the data and basically the need to redo the model.

My guess is that no one knows where the dirty data is in the dataset, so they decided to just use the data as-is.

Are there any advanced techniques, either in Power BI or another tool, that I can use to help me identify data that could potentially be dirty? The dirty data I mentioned in my question, I found using SSMS and a few T-SQL queries but I’m not sure if it’s effective enough since they’re not so obvious, and I could not detect those in Power Query.

BrianJ · November 24, 2019, 12:22am

@powerbideveloper,

Great question. I’d love to hear the techniques that others on the forum use but here are a few ideas to start the discussion:

Turn on column distribution, column quality and:column profile in Power Query. Provides great information on the empty values, errors, potential outliers, etc.

Running VALUES or DISTINCT functions on columns can help identify inconsistent data entry problems
Lots of techniques for identifying outliers, some of which may be legitimate values while others may be data entry errors

https://forum.enterprisedna.co/t/outlier-detection/3412

If you use R, people have written a lot of scripts to detect different types of dirty data. With R installed, you can now run these scripts as steps in PQ:

Eager to hear what techniques and tools others are using.

-Brian

Melissa · November 24, 2019, 9:13am

For anyone using Column Profiling please know that this is done for the first 1000 rows.

You can change this into “the entire data set” by clicking on it in the bottom left corner of your screen (indicated by the arrow below) as you can imagine depending on the size of your data set this comes at a cost so when you’re done just switch it back to the top 1000 rows
Column%20profiling

or alternatively you can use the Power Query Table.Profile function, to create a full profile report on your data, here’s an article on that.