Comparing two data set in power Bi

alora101 · April 29, 2019, 5:42am

Hi,

I was wondering if this is possible with power BI

We have versioning of files in data lake. Version 1 , 2 up to 5 depend on the changes made in each fine.

A folder (version ) contains five files , the user can change any of the files and ingest the data through data factory in azure , the new files will now be a new version and there is a transformation that merge all files together to produce two output however, in the schema in data lake , a new column is created that holds the version.

Now, a user want the capability to query any version of file so they can see what has changed just like slowly changing dimensions in data warehouse

I am thinking in this direction, to duplicate the main dataset so I can use a parameter or filter on one version and the duplicate data set using another version but I don’t know if it is the right thing to do

However, they want the ability to see the differences between two files . I don’t know if it makes sense or is it do-able in power Bi or how would you tackle this.

Lastly, they want to be able to compare these files based on dates, for instant, file version 2 that was produced on the 1st of May against file version 2 created on 2nd of May… I assumed that i need to date calendar for this

sam.mckay · May 1, 2019, 4:01am

I haven’t seen a scenario like you’re describing, but my intuition here is saying that you won’t be able to use Power BI to see the difference between the files unless you bring them actually in Power BI and complete the comparison there.

If you bring all the information into Power BI, that will give you the best chance to get this done, where you can use the query editor and right model to then visualize the difference.

I don’t believe there is any way Power BI can look back into other systems and see anything really.

All querying takes place mainly within the query editor, not from the reports - which is where your consumers will be using the reports.

It all sounds a bit to complicated here. Can you think of ways to simplify this process overall.

My recommendation is to get all the data you can (that makes sense) into Power BI, then you have all the power in world to find the insights you need.

sam.mckay · May 1, 2019, 4:02am

Another quick thought is that you want to somehow structure the data in your database so all the information sits in one main table, then you can use DAX measures in Power BI to complete all the heavy lifting with the calculating and analysis etc.

Nick_M · May 1, 2019, 3:02pm

@alora101,
I think this can be done using Power Query and some M…but would need to see a few sample files if that is possible.

Basically I am thinking to bring in all the files from a folder, but not combining them all right away. Then adding in a column for the version number. This would be the quasi-fact table (as I’m not sure if you are actually looking for all the information from the CSV file to be in the table. I would think you would, but dont like to assume).

Nick

alora101 · May 1, 2019, 3:39pm

Hi Nick,

Thanks very much for your reply,

Now, we have decided to move the output files to Databricks (database).
Each file will have a binary checksum to know if it has change or not.
Each file is loaded into a table in databricks.

Now, i can use power bi to connect to data bricks using spark connector and merge all files/table as one then implement my logic.

The above dataset i sent earlier on, it from the log file that i split to understand the requirement. Moving forward data will be coming from databricks…