Data Cleaning - Power Query - Removing Duplicates

Hi everyone, so I have data source that comes from an FTP folder with .csv extension file. It updates everyday with new data. BUT, there were data that has been duplicated coming from previous day that is shown on the current day.

Say today is Monday, some data from previous day (Sunday) is also included on the update. How can I remove the duplicated data and ensure that it will not be duplicated in the future?

Hi @ronald_balza,

If possible, find our why your FTP folder contains files with duplicated data. Contact the party that creates those csv files and resolve it at the source.

When that’s not possible, formulate a method to identify these duplicated records and how you want to filter them out in Power Query. For example maybe you can Group By a set of unique identifiers and omit the duplicates by oldest- or newest date.

I hope this is helpful

1 Like

I’ve had the same in some of my tables, and I’m using the Table.Buffer formula.
Adding this allows you to control which duplicates are to be removed via sorting the table in the step before. The sorting remains applied to your table at the moment of removing the duplicates. Otherwise the removal will happen somewhat randomly (at least in a way that I don’t follow).

  1. Expand the data from the CSV contents. Make sure that you also include a column with the date of the file itself.
  2. Sort the table data first on this File Date column with newest date on top (this way you’ll always maintain the most recently added data from the FTP files).
  3. Additionally, sort the table data as you see fit (Sales date oldest to newest for instance)
  4. Insert a new step and change the code to this: = Table.Buffer(#“Sorted Rows”)
  5. After this, remove the duplicates from the table (without including the columns that cause the duplicates like the file date).

Here’s a snippet of the code:

    #"Sorted Rows" = Table.Sort(#"Changed Type1",{{"Effective Date", Order.Descending}}),
    #"Table Buffer" = Table.Buffer(#"Sorted Rows"),
    #"Removed Duplicates" = Table.Distinct(#"Table Buffer", {"AAA", "BBB", "CCC", "DDD"})
in
    #"Removed Duplicates"
2 Likes