Near Duplicates - Power Query

OK. This can’t be that difficult. I’m 50% of the way there but not 100% :slight_smile:

I have COVID 19 people that have gone into isolation more than once, but I want to analyze the date difference/s.

So, I was fine with using PQ to group by ID number and get the date for each “duplicate” or even “triplicate” row but how do I compare the dates between rows with the same ID?

So my data looks like this

A date 1
A date 2
B date 1
B date 2
B date 3
C date 1
C date 2

All help would be much appreciated (of course)!
Thanks
Erica

Hi @Ericadyson,

It’s not clear to me what you’re after, especially since the sample provided doesn’t seem to match your description about multiple rows nor does it include actual dates making it hard to work off.

Please provide more details and a mock up with desired results would be greatly appreciated as well.

Hi @Ericadyson, we aim to consistently improve the topics being posted on the forum to help you in getting a strong solution faster. While waiting for a response, here are some tips so you can get the most out of the forum and other Enterprise DNA resources.

  • Use the forum search to discover if your query has been asked before by another member.

  • When posting a topic with formula make sure that it is correctly formatted to preformated text </>.

  • Use the proper category that best describes your topic

  • Provide as much context to a question as possible.

  • Include demo pbix file, images of the entire scenario you are dealing with, screenshot of the data model, details of how you want to visualize a result, and any other supporting links and details.

I also suggest that you check the forum guideline How To Use The Enterprise DNA Support Forum. Not adhering to it may sometimes cause delay in getting an answer.

Hi Melissa

Thanks for reaching out. This is what I want to do. I have people who have been insolation more than once. I want to check the days’ difference for each of them. So when I say duplicates… they aren’t “duplicates” in the strict sense but their ID appears more than once in the list of people who are or who have been in isolation is a given locality.

I have grouped the IDs and identified where they appear more than one. Now say I have 3 rows for that ID and 3 different dates … how do I compare each date with the previous, and then move onto the next person with multiple isolations?

Does that make it a bit clearer? Hope you can help me out.

Thanks so much.

Erica

Hi @Ericadyson,

See if this meets your requirement.

How does it work?

  1. BufferedTable step loads the table into memory
  2. DayCount step
  • This generates a table called GetData that contains only rows where the Person IDs match and where the Date is smaller that the Date on the current row.
  • returns null if GetData is empty else
  • calculates the difference in days between the max date in GetData and the current row

Note that if both dates should be included in the day count you’ll need to add 1 to the calculation.
.

Here’s the sample file. eDNA - Day count between isolations.pbix (15.9 KB)
I hope this is helpful.

1 Like

calculate days difference in dates for duplicate ids ed version.xlsx (85.5 KB)

Hi Melissa

Wow. Thanks very much. That’s so neat. I shall need to study these functions which of course are new to me.

I actually managed to get the answer but by a much longer route (although it’s fast to execute). See attached file. I’ll explain it because it might help other lesser mortals, like me!

Query 1 - query to group by ID, and retain the date (using AllRows) , retain only rows where group count of ID is >1. and add an index starting with 0.

Query 2 to end - duplicate query 1 but change the index to start at 1. Then merge (left join), based on id and index, then remove nulls from dates. This gives you rows with 2 dates (which is what we need), then a custom column to calculate the difference in days.

I’ve attached the file in case anyone else might want to copy it. It’s far longer than yours, of course,).

So mega thanks… I’m off now to learn about table buffer and the way you’ve constructed the query!
Erica

:+1:

Let me know if you have any questions.

Hi Melissa

Thanks for getting back to me. I did go through your solution… great but I had difficulty adding the first date as a second column… What did you mean by " Note that if both dates should be included in the day count you’ll need to add 1 to the calculation.". Where do I add +1 and why does it work?

Hi @Ericadyson,

I’ve rebuilt the logic in your Excel file and you aren’t including both dates so the +1 is irrelevant but if you examine the first screenshot in post #5 you’ll see the place where that should be added when needed.

Bringing back the Previous date actually isn’t that hard because you have all the elements needed to calculate it, it’s just subtracting the [DateDiff] from [Date] and you can use Date.AddDays for that.

Here’s the full M code for the query including the Previous date:

let
    Source = Excel.CurrentWorkbook(){[Name="testsource"]}[Content],
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"מספר זהות", type text}, {"ת. סיום בידוד", type date}}),
    RenameColumns = Table.RenameColumns(#"Changed Type",{{"מספר זהות", "ID"}, {"ת. סיום בידוד", "Date"}}),
    BufferedTable = Table.Buffer( RenameColumns),
    KeepOnlyMultipleIDs = Table.SelectRows(RenameColumns, each Table.RowCount( Table.SelectRows(BufferedTable, (BT) => BT[ID] = _[ID] )) >1 ),
    DateDiff = Table.AddColumn(KeepOnlyMultipleIDs, "DateDiff", each 
        let GetData = Table.SelectRows(BufferedTable, (BT) => BT[ID] = [ID] and BT[Date] < [Date] ) in
        if Table.IsEmpty( GetData ) then null else Duration.Days( Duration.From( [Date] - List.Max( GetData[Date]))), Int64.Type ),
    ExcludeBlanks = Table.SelectRows(DateDiff, each ([DateDiff] <> null)),
    AddPreviousDate = Table.AddColumn(ExcludeBlanks, "Previous Date", each Date.AddDays( [Date], -[DateDiff] ), type date)
in
    AddPreviousDate

.
Here’s your sample file: eDNA - calculate days difference in dates for duplicate ids.xlsx (92.2 KB)
I hope this is helpful.

eDNA - calculate days difference in dates for duplicate ids2.xlsx (98.7 KB)

Hi Melissa… you’re a real power house without endless patience… incredible and thank you. Even the hieroglyphics in Hebrew didn’t phase you!!

I did another version using “previous row” methodology which is much closer to the functions you’ve generated. What do you think? I’ve added another sheet “Previous row version”. It’s a little easier for me to get grips with but with far less steps than my first attempt (but still more than your slick version). I need to get used to creating functions … so that’s another step on my learning ladder.

But if you have the time I’d be grateful if you could look at my latest attempt (previous row version) and see if I can make it even better (but without functions, which as I said, I haven’t really got there yet).

By no means, urgent… because I now have 3 ways to get to the same answer. 2 are more long winded that yours , of course. Mega thanks. Erica

Hi @Ericadyson,

A couple of techniques showcased in this video could be helpful in your scenario as well, you may want to review that if you have the time.

As for what solution to go for…
Let’s start by stating that it’s not a competition to write complex- or as little M as possible. Choosing a solution you understand which performance wise is acceptable is very important. I fully agree with you there.

Kudos to you :+1: for exploring other solutions to your own question that is so important for increasing your understanding of M - there is no better way than just diving in and explore…

Ooops. I meant to write "you’re a real powerhouse WITH endless patience! So so sorry… for writing the wrong word.

1 Like