Data Model Optimisation

dimbroane · January 24, 2021, 9:44pm

Hello

One of the sources of my PowerBI file uses a mammoth view. This view aggregates data from three huge tables (SQL database), each with more than 10 million rows. The view filters down the output to the transactions from 1st July 2018 on. Even so, there are well in excess 7 million rows.

Due to the size of the PowerBI file (badly in need of optimization), 650Mb, the automatic data refresh fails sometimes; its visuals do not always display the content expected.
The file does not function properly.

My question for you is: do you know of a method to import efficiently data coming from such a view? I understand that incremental refreshing does not work on views, but on tables without any filtering.

Regards

Doru Imbroane

samaguire · January 25, 2021, 8:28am

Hi @dimbroane

Incremental refresh is supported on data sources that support query folding (Incremental refresh in Power BI - Power BI | Microsoft Docs). In my experience, this includes SQL server views.

However, it sounds like you have bigger issues at play here. I run locally on my laptop a sample database with sales data of ~12 million rows. I have no issue running a full refresh into a PBIX file.

I would suggest reviewing the SQL views to see if they can be broken down into a star schema. I.e. have several views, which when all are imported create the basis for your model - pushing the semantic model to source. This would reduce the number of columns being imported across the table, exponentially decreasing the amount of data transmitted and speeding up the entire process.

Alternatively, you may want to review the indexing against the underlying tables the view is calling on. It may be that adding a well-placed index or two could speed up the data extraction process. This should be raised with the database administrator.

There are other options as well. If you’re able to send me a copy of the ‘Export Metrics’ from DAX Studio I can review the model without needing the data to see if I can spot anything that may help.

Regards
Stephen

amira.bedhiafi.pro · January 25, 2021, 3:04pm

I think you should try to remove unnecessarily fields and then transform some specific fields with many decimals (with 2 decimals or whole number whenever is possible).

marcster_uk · January 25, 2021, 5:28pm

Do you use a Dates table in this file?.

My thinking is, if not, the Auto DateTime Time intelligence tables that are hidden does bump up file size considerably.
You’ll need to know if any measures are using the Auto DateTime Time Intelligence and if so, amend the file accordingly. As always, ensure a backup is at hand before carrying out amendments.

dimbroane · January 25, 2021, 6:42pm

Hello Stephen and many thanks for reminding me of the DAX Studio!

I attach the Vertipaq metrics at your request.

28891d58-a621-4dca-b7e1-5e7f54a2c6b2.vpax (175.9 KB)

The view I am talking about is called “Transaction Combo”; its cardinality is worst.
Second is XSJCustomer table, which can be simplified.

As @amira.bedhiafi.pro suggested, one of the greatest improvements (before even attempting the Transaction Combo view) is removing all unnecessary fields.

dimbroane · January 25, 2021, 6:45pm

Hello, no, no AutoDate table used.

dimbroane · January 25, 2021, 6:45pm

Many thanks - great advice; this is where I should start the optimisation process.

AntrikshSharma · January 25, 2021, 7:48pm

@dimbroane I would try to start working with these columns first:

Organisation ID, Customer ID, Phone, billingDPID, shipDPID see if these can be stored as integers.

If there is a pattern in a column try to split them, store dates and time in different columns and see how it impacts the refresh.

Also just want to let you know that PBIX size may be 650MB but the actual memory used by the model is 1.61GB

samaguire · January 25, 2021, 10:10pm

Thanks @dimbroane

Much like @amira.bedhiafi.pro & @AntrikshSharma has said; lets remove some columns you don’t need. NB: if removing these columns is pushed to source in Query Folding, then there is less data being imported - increasing refresh speed.

The Top 10 columns by size acount for over 50% of the model size:

Row Labels	Rows	Cardinality	Columns Total Size	Database Size %
‘Transaction Combo’[RecordID]	7,832,509	6,871,517	237,403,864	13.72%
‘XSJCustomer’[MailingAddress1]	2,829,565	2,148,871	104,557,749	6.04%
‘XSJCustomer’[formalMailingName]	2,829,565	2,082,578	100,662,480	5.82%
‘XSJCustomer’[OrganisationId]	2,829,565	2,292,975	98,879,763	5.71%
‘XSJCustomer’[custid]	2,829,565	2,829,565	96,335,601	5.57%
‘XSJCustomer’[user01]	2,829,565	1,924,384	87,229,011	5.04%
‘Transaction Combo’[Refnbr]	7,832,509	1,674,565	82,843,955	4.79%
‘XSJCustomer’[MailingAddress2]	2,829,565	1,209,322	61,418,303	3.55%
‘XSJCustomer’[Shipaddr1]	2,829,565	1,212,359	60,774,055	3.51%
‘XSJCustomer’[billingaddr1]	2,829,565	1,203,015	60,480,392	3.50%

Many of these are also in the Top 10 columns for strings only:

Row Labels	Rows	Cardinality	Columns Total Size	Database Size %
‘XSJCustomer’[MailingAddress1]	2,829,565	2,148,871	104,557,749	6.04%
‘XSJCustomer’[formalMailingName]	2,829,565	2,082,578	100,662,480	5.82%
‘XSJCustomer’[OrganisationId]	2,829,565	2,292,975	98,879,763	5.71%
‘XSJCustomer’[custid]	2,829,565	2,829,565	96,335,601	5.57%
‘XSJCustomer’[user01]	2,829,565	1,924,384	87,229,011	5.04%
‘Transaction Combo’[Refnbr]	7,832,509	1,674,565	82,843,955	4.79%
‘XSJCustomer’[MailingAddress2]	2,829,565	1,209,322	61,418,303	3.55%
‘XSJCustomer’[Shipaddr1]	2,829,565	1,212,359	60,774,055	3.51%
‘XSJCustomer’[billingaddr1]	2,829,565	1,203,015	60,480,392	3.50%
‘XSJCustomer’[billingDPID]	2,829,565	1,102,174	45,836,389	2.65%

With the above in mind, looking at the mix of data types there are some oppurtunities here to convert text strings into numbers. Ideally doing this in Power Query and having the data conversion pushed via folding to the datasource.

Row Labels	Data Size	Database Size %
String	229,412,680	72.59%
DateTime	43,247,136	10.77%
Int64	22,805,944	13.93%
Double	18,075,400	2.94%
Grand Total	313,541,160	100.00%

akila789 · January 26, 2021, 3:30am

Hello dimbroane,
Another few helpful tips.

Import only the necessary fields. No need to import all fields.
Use measures as much as you can. Do not start with calculated columns. Calculated columns take memory.
If yo have date time field , but you only need either date or time, split the column into 2 and and keep the required one. Date time fields takes huge space in memory.
Make dimensions using repetitive fields.

EnterpriseDNA · January 28, 2021, 11:48pm

Hi @dimbroane, did the response provided by the users and experts help you solve your query? If not, how far did you get and what kind of help you need further? If yes, kindly mark the thread as solved. Thanks!

dimbroane · January 29, 2021, 1:02am

Hello Sam

The help I got is invaluable; all I have to do now is renouncing the extra columns, building measures, eliminating bi-directional filters.
I am doing this whenever I can spare a moment, alongside my other thousands tasks at work.

However, the main question I had is not answered yet: how to deal with that mammoth view? Should I insist on importing it, is it possible to direct query it or can I incrementally refresh it?

Even so, I have the confidence that once the model is simplified the view as such will not be a problem.
Therefore, you can close the thread if necessary.

Kind regards

Doru Imbroane

jpratt · January 29, 2021, 2:00pm

@dimbroane, one of the drawbacks with using SQL views is that they have to compile every record from the underlying data source before applying any externally-provided filters (i.e. the view can’t say whether to include or exclude a row unless it checks all the underlying data first). That being said, properly designed views hide enormous complexity in the underlying data architecture and are generally intended to surface data for consumption for tasks like yours.

Specifically addressing incremental refreshes versus direct query: do not direct query a mammoth view, for the very reason specified in my first paragraph. But do attempt an incremental refresh if possible. The exact mechanism for incremental refreshes depends heavily on the data fields available and the needs of your report. The following info applies whether you are sourcing from a view or from tables.

Ideally, every row being imported will have some kind of date you can use to determine the “age” of the data. That could be an order date, record created date, record last modified date, etc. Then you only import rows that are of a certain age (i.e. last 10 years, last 30 days, etc.). The more complex version is “give me everything that has changed since X date”, with “X date” typically be the last time you ran the import. It gets uglier if, for example, orders can be modified after the order date but you have no modified date indicating the record was changed.

Even if you don’t have the ability to use the incremental refresh capability in the Power BI service (which I do not have because I am on-prem), you can still import only the last X days/years of records into a staging table in Power BI and then merge or append as needed to the primary data model fact table (and dimensions, if they are derived from the big import).

Other options include creating or asking for a SQL view optimized to your needs. I get to “cheat” because I develop all my own SQL Server architecture , but even in a big company you could ask for your own view.

All the other advice about reducing the columns is of course critical. You should only import the exact columns you need, especially with massive datasets where even with modern tech you can hit bandwidth and server performance bottlenecks.

Hope this helps with your main question.

John C. Pratt

dimbroane · January 30, 2021, 7:39pm

Many thanks, John; this is the complete answer I was hoping for.

Sue · January 31, 2021, 2:39pm

Really enjoyed reading this thread and your answer @jpratt. Could you use dataflows to extract from the sql view and then filter depending upon the “age” date type so you have two tables, one from historic data and one from new and then only refresh and import the new data into the desktop?

jpratt · February 1, 2021, 1:39pm

@Sue, based on what I have read about data flows (I only have PBIRS on-prem at this time), not only are dataflows a great idea for this exact purpose, but they can actually replace the “massive SQL view” altogether. You could then create shared datasets sitting on top of the data flows that provide the data pre-filtered so that all you have to do in the Power BI report is import the shared dataset. And of course you can base the data flows on the SQL view itself and build from there if there is no way to gain access to the tables underneath the view. The architect in me is itching to get involved with this!

Sue · February 1, 2021, 4:37pm

I use them for one of my customers who’s EPS system won’t allow us to connect to the source direct so provide data dumps weekly into a sharepoint folder. I used to have all sorts of issues with failed refreshing direct from sharepoint to pbix but transferrred to dataflows over a year ago and now stable. I also use them to pull in different tables for different reports and have the transformation logic repeated just once, a few limitations as only have pro license but generally really like them.