I have billions of data that PBI can’t handle for PBI modelling. It is very slow for import.
What are the strategies to handle this situation?
These billions of data is credit card transactions.
The methods applied is using TE3 for data modelling and connection is Live connection. PBI is only used for data visualization.
However this method is not ideal for PBI modelling because the data model is saved as cube (SSAS) which makes things less flexible and inefficient since multiple platforms is used such as SQL server, TE3 and PBI. The memory of SQL server is also running out. Due to large data model i.e. 20GB
Aggregation strategy is applied, Partition strategy is applied.
3 years of data need to retain.
Cloud is not an option as everything must be on-prem.
So, is there any strategy to handle this big data in PBI?
Hi @kathryne.mcglynn - Based on the post it seems most of the strategies has been already tried. One thing can be tried is to reduce the data where possible by splitting the Data into multiple models based on dates/other dimension.
May be keep latest data in one and historical data in other or based on some other dimension etc.
Dealing with billions of credit card transactions in Power BI while keeping everything on-prem definitely presents challenges, but there are strategies to optimize performance and make the model more manageable.
Since you’ve already applied aggregation and partitioning, here are some additional approaches:
1. Hybrid Mode (Composite Models)
Instead of using a full Live connection, you could explore DirectQuery + Import Mode (Composite Models). This allows Power BI to store summarized data locally while keeping detailed data in SQL Server. You can pre-aggregate frequently used data while still having access to the full dataset via DirectQuery.
2. Incremental Refresh for Large Data Models
Power BI Incremental Refresh helps to load only new or modified data instead of importing everything repeatedly. Since you’re retaining 3 years of transactions, setting up incremental refresh on partitions (e.g., monthly or quarterly) can significantly reduce memory consumption and improve performance.
3. Optimizing SQL Server Performance
Since SQL Server memory is running out, consider:
Columnstore Indexing: Compresses large transactional data, reducing memory usage.
Pre-aggregations in SQL Server: Offload some of the aggregations from Power BI to SQL Server using materialized views or indexed tables.
Horizontal Partitioning: If your existing partitions are too large, break them into smaller segments (e.g., separate partitions for high-frequency and low-frequency transactions).
4. TE3 Performance Tuning
Evaluate whether measure calculations can be optimized.
Ensure you are only loading necessary columns instead of the entire dataset.
Use hierarchical aggregations so Power BI doesn’t have to scan detailed transactions on every query.
5. Reduce Cardinality of DAX Calculations
High-cardinality datasets lead to inefficiencies. Try:
Removing unnecessary fields in relationships.
Optimizing DAX measures by avoiding repeated calculations across multiple columns.
Using pre-calculated tables instead of dynamically calculating everything in Power BI.
6. Optimized SSAS Model
If you’re using SSAS cubes, consider Hybrid OLAP (HOLAP) instead of MOLAP.
Pre-aggregate key metrics in SSAS before sending them to Power BI.
If you can share additional details regarding your existing data model (Tables, columns, data types), aggregation levels (daily, monthly etc), I can provide you with targeted optimization techniques.
Alternatively, you can look into R, Python, or other programming tools that can handle billions of rows on-prem and visualize the data, here are some powerful options:
Python-Based Solutions
Python has several libraries that can efficiently handle large datasets:
Dask – Designed for big data processing. Supports parallel computing and handles large datasets better than Pandas.
Vaex – Optimized for fast visualization of large datasets (billions of rows) using lazy evaluation.
Modin – Accelerates Pandas operations using multi-threading, making it useful for large-scale data analysis.
PySpark – Great for handling big data using distributed computing via Apache Spark.
Plotly/Dash – Enables web-based interactive data visualization for large-scale datasets.
Matplotlib & Seaborn – If you can pre-aggregate your data, these traditional visualization libraries can work efficiently.
R-Based Solutions
R is great for statistical computing and visualization:
data.table – Faster alternative to dplyr, optimized for handling large datasets.
sparklyr – R interface for Apache Spark, enabling big data processing.
ggplot2 – Works well if you reduce the dataset size via aggregation or sampling.
Shiny – Builds interactive dashboards for large datasets.
DT (DataTables) – Optimized for interactive tables with large amounts of data.
Hybrid Approach (SQL + Python/R + BI Tools)
Since cloud is not an option, consider a hybrid setup where:
SQL Server (on-prem) manages raw storage and partitioned data.
Python/R handles data transformation and visualization.
BI tools (like Power BI or Tableau) layer on top for dashboards.