Power BI Challenge 6 - Insurance Complaints Entry from Brian

EnterpriseDNA · September 10, 2020, 1:44am

Here’s Brian’s entry for Power BI Challenge 6. @BrianJ, feel free to add other details of your work.

Here’s how Brian described it:

My strategy was to use the challenge to explore some of the AI and machine learning capabilities of Power BI that I hadn’t previously delved deeply into. Here’s how I was planning to structure my entry:

Fraud detection algorithms built in DAX and R to detect irregularities in the data based on Benford’s Law (based on a suggestion from @Paul).

Analysis and visualization of status changes and when they happened using Synoptic Panel to transform a static process flow diagram into a dynamically linked visual that would depict status changes clearly and graphically.

Complaints broken down by dimensions using Microsoft Research’s Sand Dance visualization/ML interface.

Worst offending brokers analyzed using Power BI Key Influencers visual.

Here’s how it shook out: #1 I think was a big success, #2 was a partial fail – I learned a lot, ultimately got it to work but it ended up being a confusing visual that didn’t provide significant value or insight. In the right situation, I do think this is a good technique and you’ll probably end up seeing me use it down the road in another challenge. One that you will not see me use ever again is #3 Sand Dance. This section was an epic fail – I hated the interface and the look of the resulting visuals. #4 I think worked pretty well in terms of making what would be quite a complex analysis easily accomplished and accessible to even relatively new users. Given the failures on sections #2 and #3, I don’t have a complete entry to present, but thought some may find value in the explanation of #1 and #4, so here goes:

FRAUD DETECTION ALGORITHMS

Given that this challenge is about preparation for an audit, I thought focusing first on the reliability and validity of the data was a good starting point. Forensic accounting frequently uses Benford’s Law to detect irregularities in data that may be attributable to tampering/fraud. Benford’s Law states that:

“In many naturally occurring collections of numbers, the leading digit is likely to be small. For example, in sets that obey the law, the number 1 appears as the leading significant digit about 30% of the time, while 9 appears as the leading significant digit less than 5% of the time. If the digits were distributed uniformly, they would each occur about 11.1% of the time.” (Wikipedia)

Benford’s law has been shown to apply to series’as varied as transaction level accounting data, census data, home addresses, areas of lakes, stock prices, death rates, home prices, distances between planets, and a wide array of other types of data series.

For this challenge, I calculated using some virtual table-heavy DAX measures the frequency distribution of the first digits of the daily number of complaints, the expected reimbursement amounts and the population of the cities of the clients associated with each complaint. What I found is summarized in the visuals below, which compare the observed distribution of the first digits of each of the chosen fields versus expected distribution under Benford’s Law.

I then tested using the attached R script whether statistically the observed distribution conformed to the expected Benford distribution (note: this cannot be done reasonably within Power BI, because to generate the necessary P values to evaluate statistical significance you need to simulate 10,000 trials of this test. R can do this with just a few lines of code and almost instantaneously). In the case of complaints and expected reimbursements, the distributions did conform, while the distributions associated with city population did not. The first two results strongly imply that these data are real, not fabricated. The third result strongly indicates that this data either is fraudulent or has been randomly assigned to clients in a way that does not represent their actual locations and other attributes of the regional table. I groundtruthed these findings with @Haroonali1000, and he confirmed that my findings for all three were indeed correct.

This is a powerful technique that can be used in many different contexts to evaluate data quality.

IDENTIFYING WORST-PERFORMING BROKERS USING KEY INFLUENCERS VISUAL

There are a lot of different ways one could define “worst performing” per the requirements of the brief. I thought the most direct way was to identify which brokers had the lowest levels of client satisfaction. To do this in a statistically rigorous way, one would typically need to build a multiple regression model. However, standard regression models won’t work if the outcome variable is categorical or in this case binary (satisfied versus not satisfied). To do that analysis you would typically need to employ a logistic regression model in R or a similar statistical package. However, Microsoft fairly recently incorporated the Key Influencers visual which easily does the heavy lifting of building the logistic regression model for you.

Simply by selecting the Key Influencers visual and dropping the right fields into the appropriate wells, you get a fully interactive logistic regression visual that identifies the worst performing brokers:

By dropping down the outcome variable slicer, you can also quickly change the visual to show the best performing brokers.

Anyway, disappointed that I don’t have a full entry to submit, but hope you found some value in what I learned through experimentation above.

To learn about the real-life scenario presented for the challenge, be sure to click on the image below.

BrianJ · September 10, 2020, 2:25am

Extremely disappointed in my performance on this one. Moderation has never been my strong suit, and the urge to “go big” on every aspect of this challenge really backfired this time. Two of my planned approaches just didn’t work in the way that I expected they would, but even if they had, in retrospect I think the entry would’ve ended up looking strange and gimmicky. I’m a huge proponent of challenging yourself and learning new things, but that also shouldn’t come at the expense of practicing good fundamental technique, and I think I lost sight of that on this one.

The wise words of my friend @JarrettM ring loudly in my ears, and definitely will be something I keep in the forefront for Challenge #7:

“One thing I would suggest to everyone in these challenges is that it is fine to keep things simple, you do not need the big wow factor. In my opinion, many of the top entries are the ones that are simple, and easy to understand.”

A very sincere hats off to the other participants. I was completely blown away by combination of design excellence and analytical depth that I saw across the board in your submissions on this challenge.

Brian

sam.mckay · September 10, 2020, 3:15am

Impressed with the efforts here Brian. You are really showcasing you data science expertise in this one.

Personally I’m no R expert, so plenty for me to learn there. It’s exciting seeing what you can achieve over and above just using DAX, which as we know is already amazing.

Interesting you’ve gone for the custom visual key influencers. Looks interesting, something again I haven’t really used.

I like how your using this opportunity to really test yourself on new things, I guess these challenges are perfect for this as you can test some things out and not be too worried if it breaks.

I think you’ve inspired me to challenge myself more next time around on using some new visuals and also to think outside the box around insights and calculations that we could be doing on data. Will have to see what the next challenge scenario is before deciding.

Appreciate all you do and learnt a lot from you work on this one.

Sam

liberrenaud · September 10, 2020, 8:27pm

Hello @BrianJ . I saw that you used in week 4 and 6 DAX and R to dive further in some of the analytics elements. Is there any document / book that you would recommend that dive on how to best integrate both technologies?

Your submission are really inspiring! Thanks for sharing all your lessons learned!

All the best,
Renaud

BrianJ · September 12, 2020, 5:19am

@liberrenaudm

Thanks very much for the kind words. It’s great to hear that you’re finding my Challenge entries helpful. I’ve actually been really energized by the number of questions and requests I’ve gotten from members about learning and using R in conjunction with Power BI. As a result, I’ve proposed to @sam.mckay that we create an R Code Showcase section on the forum, similar to the M Code Showcase that @Melissa initiated to house all the great information being provided on PQ/M.

As foundational building blocks of this section, I am researching two questions and putting together lists of recommended resources for both:

for people with little or no statistical background, what are the best resources for learning the key concepts necessary to conduct, interpret and report the results of appropriate statistical testing and analyses?
for people already with a good foundational statistical background, what are the best resources for learning R within the context of using it as an adjunct to Power BI? I think that last part is important, because learning it for use within Power BI is quite different than learning R on its own. What I mean by that is that in the former case, you can short-circuit learning a lot of the intermediate and advanced R data handling techniques, and instead rely largely on Power Query to do the heavy lifting in your data prep/data cleaning.

I’m also working on finishing up the first video in a series I’ll be doing on “Beginning R for Power BI Users”. This will start from the point of assuming you have a sound statistical foundation, but no knowledge of R, and will present a simplified approach for getting up to speed really quickly in R using concepts and tools you already know from Power BI.

So, this is a very long-winded way of saying that I don’t yet have a good answer to your question, but I’m working hard to provide one soon. Stay tuned…

Brian

P.S. in addition to my challenge entries that you mentioned, you may also want to look at my entry in Challenge #2, which was the deepest dive I’ve done so far in the challenges on Power BI/R integration. The thing I found really interesting about this particular challenge was the number of key questions where someone likely could have come to the wrong conclusions looking only at the Power BI analyses without incorporating the appropriate statistical tests to determine if the relationship/trends observed were statistically significant.

liberrenaud · September 12, 2020, 9:49am

Thanks for your in depth and thorough answer @BrianJ. I am looking forward to check those first videos.

All the really best!
Renaud