Does anyone know if it is possible to perform a 2-step fuzzy group in Power Query/Power BI?
We have a dataset (60K+ rows) that looks similar to the below:
We want to be able to group the customers together based on a combination of name, address, and zip. However there is a lot of variability in the fields so the process would take an extremely long time and the accuracy of the groups would suffer.
Is it possible to first perform a Primary group operation on a field where we can apply a very high similarity threshold as we are confident the data is relatively clean (i.e. Zip Code)?
Once grouped by Zip Codes, we then perform a secondary grouping operation on name/address within each primary group. The secondary groups would have a significantly smaller subset of addresses to group accounts by which will hopefully increase accuracy.
In future would you be so kind to include the sample file as well? Thanks in advance.
Copy this M script into a new blank query.
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WMlTSUfLKzwOShkbGCr6JmXkKwSVFqaklQBGf/GIFx7z01JzUYiDP2RFIWBoYGRooxepEKxmBNWag6dQDaXPEVG0M5Dll5uQAKRNTMwWnovzElPLESoglOO0wAfLCgboyE3OhGv3QtOJ2nymQF5KZlpaYV4nkRCDTs6gsMy8VrsHIzNAIrMEMqwaon7BqigUA", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [ID = _t, Name = _t, Address = _t, City = _t, State = _t, Zip = _t]),
ZipGroup = Table.Group(Source, {"Zip"}, {{"Group", each _, type table [ID=nullable text, Name=nullable text, Address=nullable text, City=nullable text, State=nullable text, Zip=nullable text]}}),
NameAddressGroup = Table.TransformColumns(ZipGroup, {{"Group", each Table.FuzzyGroup( _, {"Name", "Address"}, {{"Name Address Group", each _, type table [ID=nullable text, Name=nullable text, Address=nullable text, City=nullable text, State=nullable text, Zip=nullable text]}}, [IgnoreCase = true, IgnoreSpace = true, Threshold=0.5] )}} )
in
NameAddressGroup
With this result
Fuzzy matching is a bit of a black box, so you might have to adjust the threshold level in your production model, and I highly recommend reviewing the documentation as well.
Welcome to the forum @jhu23 ,good to see that you are having progress with your inquiry.
Did the response provided by @Melissa help you solve your query?
If yes, kindly mark as solution the answer that solved your query.
If not, kindly send us the masked pbix file to help you further. Here are some videos that may help you masking sensitive data and create datasets and data models representative of your problem: