ETL from Excel to Useable Data, Extract, Transformation, Load Clean Data

Hello,

I’m wondering what the most efficient method of extracting data from an excel file that doesn’t have nice clean data to work with. It is more like a document that has labels and fields spread out across the page in horizontal and vertical groups. I have close to about 500 of these to compile to make a useable dataset for analysis. I tried going through and doing some work in power query, but I feel there has to be a better why to get this data.

Assumptions:
Every single document is exactly the the same.
Some revenue and cost breakdown is missing. In those situations I might need to extract the totals as well as the breakdown.

Extraction:
I’m looking to extract everything highlighted orange.
I’m might have to include extracting the totals (not highlighted) because some N/As make up the revenue and cost breakdown as mentioned in the assumptions above.

Transforming:
I did my best to try and use the Power Query to extract the data I needed, however, I think there has to be a better method. I considered Alteryx, but I don’t have an active license, and would prefer to stay within Power BI. I don’t know if M, R, Python or Power Automate, or some other language or tool would make this more efficient. I started with one excel file, and broke it into 4 tables, then created compiled tables. At that rate, I’ll have over 2500 tables, and feel like it is going to be very hard to apply the same transformation to each file, and manage all of them.

At the current stage I have been able to load files separately, clean them, and then append them. I then duplicated the files and created lookup tables.

When I tried to bring in all the folders at once, and I transpose it. I’m left with a huge amount of rows, and the headers change to prevent duplicate names, eg. Customer, Customer 1, Customer 2.

I’m stuck trying to get from the view in Data Model 1 to Data Model 2

Loading:
Ultimately all this data needs to be compiled into a dataset. I imagine like most data models, I would have some lookup tables, like customer and item, and a fact table which has the revenue, the costs, the margins, etc.

Cleaning: I would change N/As to blanks and nulls to blanks.

Any thoughts on the best method for ETL or how to append the transposed data? I appreciate the insight.

Sales Data 1.xlsx (16 KB)
Sales Data 2.xlsx (15.9 KB)
Sales Data 3.xlsx (16.0 KB)
Sales Data 4.xlsx (16.0 KB)
Sales Data 5.xlsx (16.0 KB)

Sample ETL.pbix (276.8 KB)

Bumping this post for more visibility.

I found a solution to my post.

This requires the creating and invoking custom functions.

It anyone gets stuck on the same issue, here is a link that should explain everything.

The only difference was the example from MS docs shows a csv, whereas I had xlsx files there was a small variance in exact steps.

1 Like