ETL Query with multiple PDF'S

Hey DNA,

Scenario:
• Client A has X amount of customers.
• Every month X customers send a PDF statement to Client A (the format of the statement varies from customer to customer and is not universal).
• Although each PDF statement differs in design, it contains similar data which is to used to populate 4 tables within Power BI on a monthly basis (e.g Sales, Products, Customer, Costs)
• Each month the data including historical and new statements will be sourced into these Power BI tables and then the data analysis can begin.

Suggested Solution:
• Whilst the data probably should be loaded into a DB and sourced from there, I was flirting with the idea of creating a folder for each customer and using power Query to bulk upload on a monthly basis and ETL.

Questions:
Does anyone -
• see any issues with this?
• think it is possible?
• have a better idea?
• and have a suggestion on how this could be achieved?

Thanks Guys

I think the question would be how you would deal with the brittleness of depending on the file formats not changing.
Loading the data into a db would probably give you a more stable configuration to load into Power BI.
The real use for having the data in the DB would be to move “upstream” and have the data entered into the DB and then have the DB produce the pdf reports, and serve a source for loading data into power BI.

Agreed re database- Did I mention i was being lazy / efficient :wink: Potentially as a one off exercise for all past statements it still has value.
Also agreed that it needs to rely on customer 1 maintaining the same format on a monthly basis which then differs from customer n+1 so lets assume this assumption

Also please work on assumption that customer is informing client of sales and costs. Seems a strange scenario but have seen this across a few sectors now

Hey

This is very much possible, it’s a matter of how much effort you want to put into it. i.e. for each customer, you’ll need to have different queries as the pdf layout is different - meaning multiple queries to maintain.

The below video is by Gil Raviv who is bit of a power query legend. At some point in the presentation he goes through the process of pulling Covid-19 data from publicly available WHO pdfs. I think it will help give you some ideas on how this could work for you.

Hi @Hitman , did the response provided by @samaguire elp you solve your query? If not, how far did you get and what kind of help you need further? If yes, kindly mark the thread as solved. Thanks!