Possible to perform a multi-step Fuzzy Group?

jhu23 · September 17, 2021, 9:28pm

Hi,

Does anyone know if it is possible to perform a 2-step fuzzy group in Power Query/Power BI?

We have a dataset (60K+ rows) that looks similar to the below:

We want to be able to group the customers together based on a combination of name, address, and zip. However there is a lot of variability in the fields so the process would take an extremely long time and the accuracy of the groups would suffer.

Is it possible to first perform a Primary group operation on a field where we can apply a very high similarity threshold as we are confident the data is relatively clean (i.e. Zip Code)?

Once grouped by Zip Codes, we then perform a secondary grouping operation on name/address within each primary group. The secondary groups would have a significantly smaller subset of addresses to group accounts by which will hopefully increase accuracy.

Something like the below?

Any help or insight would be greatly appreciated! Thank you all in advance!

Melissa · September 18, 2021, 10:45am

Hi @jhu23,

Welcome to the Forum.

In future would you be so kind to include the sample file as well? Thanks in advance.
Copy this M script into a new blank query.

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMlTSUfLKzwOShkbGCr6JmXkKwSVFqaklQBGf/GIFx7z01JzUYiDP2RFIWBoYGRooxepEKxmBNWag6dQDaXPEVG0M5Dll5uQAKRNTMwWnovzElPLESoglOO0wAfLCgboyE3OhGv3QtOJ2nymQF5KZlpaYV4nkRCDTs6gsMy8VrsHIzNAIrMEMqwaon7BqigUA", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [ID = _t, Name = _t, Address = _t, City = _t, State = _t, Zip = _t]),
    ZipGroup = Table.Group(Source, {"Zip"}, {{"Group", each _, type table [ID=nullable text, Name=nullable text, Address=nullable text, City=nullable text, State=nullable text, Zip=nullable text]}}),
    NameAddressGroup = Table.TransformColumns(ZipGroup, {{"Group", each Table.FuzzyGroup( _, {"Name", "Address"}, {{"Name Address Group", each _, type table [ID=nullable text, Name=nullable text, Address=nullable text, City=nullable text, State=nullable text, Zip=nullable text]}}, [IgnoreCase = true, IgnoreSpace = true, Threshold=0.5] )}} )
in
    NameAddressGroup

With this result

Fuzzy matching is a bit of a black box, so you might have to adjust the threshold level in your production model, and I highly recommend reviewing the documentation as well.

I hope this is helpful

EnterpriseDNA · September 20, 2021, 12:43am

Welcome to the forum @jhu23 ,good to see that you are having progress with your inquiry.

Did the response provided by @Melissa help you solve your query?

If yes, kindly mark as solution the answer that solved your query.

If not, kindly send us the masked pbix file to help you further. Here are some videos that may help you masking sensitive data and create datasets and data models representative of your problem:

jhu23 · September 20, 2021, 4:46pm

That was perfect! Thank you so much for your help with this!