Efficiently identifying rows with invalid dimension combinations

Mark · March 9, 2021, 6:14pm

Hi all,

I have a ‘Data Validity’ requirement to filter a fact table, so that is shows records where we have an ‘Invalid Combination’ of dimensions. This is where the business has provided some rules saying we should not have any records where DIM A has a value of X and DIM B has value of Y.

My problem is trying to find an efficient way to do this as the fact table is relatively large at 5M rows and I would like to avoid adding a new column.

Please see attached mock up where I have I tried to replicate the issue: Invalid dimension combinations.pbix (181.3 KB)

We have two Dim tables: Regions and Customers and a Sales fact. The table below shows the number of Sales, split by Region and Customer Type:

The rows in red are what we could classify as a Invalid Combo, so:

Africa should only have Retailers.
Asia should only have Wholesalers.
North America and Europe can have any Customer Types

I tried creating a simple measure which returns 1 where the invalid combinations are found:

Invalid Combo Check = 
var __Region = MIN(Regions[Region])
var __CustomerType = MIN(Customers[CustomerType])
RETURN
IF(__Region = "AFRICA" && __CustomerType IN {"DISTRIBUTOR","WHOLESALER"} , 1,
    IF(__Region = "ASIA" && __CustomerType <>  "WHOLESALER", 1, 
        BLANK()
    )
)

This works fine in my Matrix:

And I can also add the measure into my Sales Detail table, allowing me to report the problem records to the users:

Whilst this appears to work well on this small mock up dataset, it performs very badly on a 5M row fact table as it has to iterate through each row to check the combinations. Also in reality the rule set is much more complicated.

Can this be achieved using a measure?

Many thanks as always,

Mark

BrianJ · March 9, 2021, 6:26pm

@Mark,

Great explanation of the problem, and thanks for posting the PBIX file upfront.

I have changed the category of this post to “Optimization Lab” as it is an absolutely perfect fit for this new initiative.

Paging Dr. @AntrikshSharma to the Lab…

Brian

ankit · March 15, 2021, 12:52pm

Hi @Mark

It will be better to go for a Calculated Column instead of a Measure. By using Column, your data will be filtered before evaluating the measure and based on my testing it will reduce Visual time by almost half.

Please try creating a column like below

Col_Invalid Combo =
IF (
     (
        Sales_Data[RegionCode] = "AFR"
            && RELATED ( Customers[CustomerType] ) IN { "DISTRIBUTOR", "WHOLESALER" }
    )
        || (
            Sales_Data[RegionCode] = "ASI"
                && RELATED ( Customers[CustomerType] ) <> "WHOLESALER"
        ),
    1,
    2
)

This will provide simple 1 and 2 values and due to Columnar storage won’t occupy much space also. Simply use this column as visual/Page filter to remove unnecessary rows before evaluation.

PS: If using Column is not an option, then I have tried 1-2 different measures but they won’t improve performance.

Thanks
Ankit J

EnterpriseDNA · March 16, 2021, 1:21am

Hi @Mark, did the response provided by @Ankit help you solve your query? If not, how far did you get and what kind of help you need further? If yes, kindly mark as solution the answer that solved your query. Thanks!

Mark · March 16, 2021, 8:56am

Hi @ankit,

Thanks for taking the time to look at the problem. It does look like the calculated column will be the best solution then, I just wanted to check to make sure that I wasn’t missing something - I always try to avoid adding columns where possible!

Thanks again,

Mark