Dataflows vs Shared Dataset

Hi!
I am considering data flows vs. shared data sets in a project. I wonder if I am missing some obvious advantages with using data flow would like your opinion.

I have an existing DW with fact and dimension tables already in place. Some minor tranformations are still needed in Power Query (PQ) to get the data in the right shape. I will problably need to create several models for analysis and some dimensions will be shared between several models.
My initial idea was to create a number of shared datasets, maintained in separate PBIX files. These would only contain the data and the data model, no reports.
Reports would be created connecting to the shared datasets, and published to other workspaces. This would give a good separation between ETL and modelling and the visualizations.
Now consider using data flows instead/as well. I really like the separation between data/ETL and visualization but I find some things confusing.

A data flow only contains the data and the ETL in form of Power Query online (PQOL). I will still need a dataset to model that data and to create measures. If I want to separate the model from the visualization I will still need to have the data model in a separate PBIX. Only difference is that instead of reading directly from the data source (DW) and doing the transformation in PBI Desktop, I am connecting to the data flow and the transformations are managed in the data flows instead of in Power BI Desktop. I do not want to have the data model in the same PBIX as the reports as this make maintenance a pain.

The benefit of using data flows as I see it are:

  1. No need to have direct connection to the data source from Power BI Desktop as this is done through data flows instead.
  2. Refresh can be individual per data flow and not the entire data set.
  3. Using data flows allow me to connect to additional data sources in PBI Desktop whereas if I connect to a PBI dataset I am limited to whatever is in that dataset.
  4. Common data sources like calendar tables can de defined ONCE, no need to copy M-queries between PBIX files.

…but
I will still have the problem with separating visuals from dataset
All models that I create that need the same measures requires these to be duplicated

What am I missing with data flows?
Thanks Fredrik

Hi @Frede,

Have you already checked out this video?

.
This is also definitely worth checking out. Christian Wade, had some cool announcements at MBAS 2020 last week, look up this video, to get a scoop on what’s coming…
OND3036
What’s new and coming in Power BI for enterprise-scale deployments

@Melissa Yes I have, but I still don’t see the natural fit for data flows. IMO it seems that the truly right way to do it SSAS where you build the model, define relationships AND create your measures.

Here is how my company uses dataflows:

  • I create data flows for the ERP system I am most familiar with
  • while a colleague creates data flows from the other ERP, which he knows better

we both need to use both ERPs in our final solutions, which are shared datasets, but we don’t share our datasets with one another (the model would have to be too bloated to accommodate each of us). So our shared data set is really only shared within our respective reporting solutions.

The benefit is that an expert in the ERP has developed the data flow you are using, and the burden of the transformations takes place in the service (at least the way we are doing it), so refreshes in the desktop model are much faster.

Even if you are not dealing with multiple ERPs (we actually have four I could be reporting from), pushing the majority of the transformation work to the service is a BIG plus for me. Also, as you pointed out - refreshes can be done as you choose, in this way I was able to compartmentalize my data and have older tables that only get refreshed on a monthly basis, while my more current tables get a daily refresh.

@Heather Thanks for your input Heather. How do you manage measures between these different entities? Let’s say the HR dept creates a data flow containing an Employee entity. You need a measure “Number of employees”. Now for every solution that use the Employee entity you will need to define that measure in every data set with the risk that the exact definition of the measure will differ between the different data sets. I suspect this is an issue with an AS tabular model as well.

You can’t define measures in the dataflows at all, what happens in the dataflow is only equal to what is going on in the query editor. That’s why the data set is still needed.

And my company is small enough that we have two developers for BI - myself and my colleague. And the models that he and I are using are very different - he needs reports around accounting data, while my reports focus on sales and customer data.

There is a small overlap in what we need, but we are both able to handle those few measures in our distinct data set. Currently, we each have only one or two datasets, and all of our reports are connected to one of the datasets we have created.

Example - I have two datasets, one for internal use, the other for reports that are ‘customer facing’
these two data sets are connected to multiple reports. But both datasets are ‘fed’ from the same set of data flows.

1 Like

The issue i’m facing currently is expanding this to an enterprise level. I want self-service BI environment, where end “power-users” can create models and measures based off of a pre-defined dataflow. Currently this is not available and I’m not sure how to go on.

But currently I use Dataflows

I"m sure others can comment on the enterprise level development with data flows, and that may need to be a separate forum discussion, but to be honest when I first started investigating them for use in our environment, enterprise level benefits were all I was seeing.

In many cases I was reading, the data flows were developed by someone who may never even touch the data set, and shared data sets were created by people who were’t the final report developers. This compartmentalized the work to allow people to create solutions based on their job and technical strengths.

In my case, I have a job where I wear a LOT of hats, including pointing IT to the resources they needed to learn about in order to build the gateway that is used to pass information into the data flow.

Hello @Frede,

You should watch Reza Rad’s session today at London PUG explaining in detail what to consider an When one is suited or the other one.
Main bullets are here (screen-shot from the webinar)

This is the link with the video: https://www.youtube.com/watch?v=PXGZ0YhJSQM

Having all these info now you should be in a good position to get the best option.

If you liked my solution please give it a thumbs up :+1:.

If I did answer your question please mark my post as a solution :white_check_mark:.

Thank you !

Cristian

Hi Cristian!

I have watched that (and all other videos on data flow that I could find), but I still don’t think it’s obvious when to use data flows and when not to. As @Tylesftw says there is no modelling involved in the data flows, an entity is basically just comparable to a table (for a consumer of the entity) so I don’t really see how usable they are for end users as they must understand the relationship between the entities to be able to build an analysis model. To me a shared dataset that contain A) data B) the relationship between data and C) any measures needed is more fit for self-service. A user that want to design their own reports can just connect to the dataset and do not need to worry about the data model as it is in the dataset. It sounds to me that data flow is really (only) aimed toward large enterprise where you need multiple analysis models that share parts of the same data and having data flows will enable you to define the entities (tables) once instead of having to define them for every analysis model.

Hi @Frede,

Having all these info … in the end it’s the developer choice what to use, right? :slightly_smiling_face:

Cristian

I agree, and I wonder how common this is and if it is even desirable. Generally working in silos is a bad thing. If an end user need additional info in his report, he would go to the data modeller and ask her to add a new column, the data modeler would go to the data flow creator and ask for that additional data to be loaded to the entity from the data source etc. Isn’t this part of what we want to avoid? :slight_smile:

Well, obviously in my situation it is not at all desirable, if it would even be possible, but it’s not since I don’t have enough people in my organization that understand Excel that well - let alone Power BI.

However, consider the other argument - for one of my ERP systems, an account can have both ship-tos and departments. Neither of these elements requires the other, and if I were to design a dataset with a straightforward relationship to either the department or the ship-to, then I could not filter to the other without some DAX gymnastics.

Now, I know how to do that, so I have one measure that reports sales - and it is designed to report properly if I am filtering by Account, Ship-To or Department (with many thanks to Microsoft for the ISINSCOPE function).

I can hand this dataset off to someone who doesn’t understand this relationship, and tell them to just but the sales measure into a table. Because I know how to design the dataset to be robust, their model will just work, no matter how they choose to slice it.

As Cristian mentioned above, it’s really up to the developer, and your particular needs.
For me, the biggest reason to use dataflows is processing power, I’m currently doing a lot of work on my dataset from home, and I would be spending all day just on refreshes if I had to handle everything on my desktop computer.

Hi @Frede, a response on this post has been tagged as “Solution”. If you have a follow question or concern related to this topic, please remove the Solution tag first by clicking the three dots beside Reply and then untick the check box. Thanks!

@Heather That is indeed a big benefit of dataflows. I often find myself in situations where the data source is not easily accessible from my personal computer.