Transforming a Messy Dataset (PDF source)

Rafal_J · April 19, 2021, 10:33am

Hello,

I am inexperienced PQ user (GUI+lightweight formula tweaking). I have come across a dataset which looks a bit erratic. Story behind it is that a document comes in PDF format once in a while, then it needs to be processed (which is manual labour atm). Then next one comes - I am pretty sure that base format is the same, which needs to be appended to the former.

I have grinded it for a while without much success. My goal is to automate transformations to extent that the next PDF gets appended and will hopefully not break, and then enhance it further and analyze it in PBI.

The specific problems I cannot deal with at my advancement level are :

Records “belonging” to one column are scattered across different columns;
Records contain mix of irrelevant and relevant information;
Subsequent rows of the same column contain information which needs to be merged;
A record is compressed into pieces of information belonging to different columns;
An irrelevant information looks to be interwoven without a discernible pattern.

I am attaching .xls containing an example of complete dataset (named MessyDataset), it originates from PDF file, imported into PQ, combined from underlying tables and then copy-pasted into Excel file.

In above .xls a sheet named DesiredOutcome, which is showing what I would like to specifically achieve, with some references to the records or problems which I find particularly puzzling.

Thanks in advance for tips and suggestions as to which techniques and tricks might work here. I would be particularly happy to know your thought process behind tackling transformations of such a dataset.

All the best

RafalMessy Dataset.xlsx (91.5 KB)

cmalone1112 · April 19, 2021, 12:42pm

Hi Rafal,
I downloaded your file and spent about an hour with it, I couldn’t get this to work as you expected. Is there no way to clean the file before loading?

Keith · April 19, 2021, 9:50pm

why don’t you try going back to your IT area to get the data extract that populates that pdf file?

Also could you provide the pdf file, likely easier with the actual pdf file then the messydata sheet in excel.
thanks
Keith

Rafal_J · April 20, 2021, 8:37am

Hi guys,

Thanks for your remarks,

@cmalone1112 - this is external file, no influence on the form it’s coming through;
@keith - this .xls looks exactly like a .pdf after loading it into PQ and combining the underlying tables , so I don’t really see the point, am I missing something ? IT folks looked into that and sort of gave up on achieving the final result.

Best

Rafal

deltaselect · April 20, 2021, 9:22am

How do you download the data from the external source ? Could you separately import individual columns from the external source , and combine the columns in the Power Query Editor ?

Keith · April 20, 2021, 9:24am

sometimes its easier with a pdf file instead of excel. did you check on youtube?

Rafal_J · April 20, 2021, 1:08pm

Thanks for your time guys,
@deltaselect - this is a file stored locally, it imports in a standard PDF way, so the structure in the navigator is N tables + M pages. I guess the answer is no.
@keith - indeed, but did not uncover a clue whether this or that format is easier/harder to transform, have you got any specific vid in mind ?

deltaselect · April 20, 2021, 5:41pm

Hi Rafal,

I did some puzzling in Excel ( also not very experienced in Power Query),
It seems that you are looking for data whereby order numbers (??) bigger then 3.000.000.000 from column B contains your data.

A) I added column in Excel:

if Length = 10 and Nr > 3 billion, take the nr
if Length bigger then 10, but last 10 numbers > 3 billion, take the nr
Else not needed.
Excel formula =IF(AND(LEN(B5)=10,B5*1>3000000000),B5,IF(RIGHT(B5,10)*1>3000000000,RIGHT(B5,10)*1,IF(ISERROR(B5),“Text”,“not needed”)))
For some not required lines this formula still gives a -VALUE. (Values are not accepted in Power Query, gives errors), but at least it give the order numbers, starting with a 3.

B) Then you can concatenate text from the other columns in the row with the order numbers with formula =CONCATENATE(B2,",“B3,”,"B4), (text split by a comma). And load that in Power BI.

Far from ideal, the delivered data-output should be improved. (In my own experience, some external organizations were willing to deliver a query instead of PDF-files with headers)

cmalone1112 · April 20, 2021, 6:20pm

Check out the April 2021 version of Power BI there is a text/CSV by example as a new feature, might get you what you need

Keith · April 20, 2021, 8:27pm

nothing specific but there is vid using R within power bi and other videos that can help. Having the original pdf file is better. its just easier having the pdf file. thats all

Rafal_J · April 20, 2021, 9:04pm

@deltaselect
I thought about using a function separating numbers from text, that might do the trick for B column, but partially only, also considering dividing whole set of tables to subsets like „easy transformations” „edge cases”, applying different steps and then appending them together. Yet it wouldnt be a compelling solution.
@cmalone1112
Thanks I will check this out.
@Keith
I have seen one with R script used to create text from pdf, originated from biccountant blog, I am not sure if thats helpful there, other stuff looks to be showing the import techniques only

Keith · April 20, 2021, 9:43pm

sometimes you just have play around to find the best solution. sometimes you have to take components of several videos or processes to get to the final results. Its just a matter of testing different routes

deltaselect · April 21, 2021, 10:54pm

Hi Rafal,

I double checked your messy data.
It seems much more cleaned up by the following steps, done in Excel:

Make a index/ sequence numbering in column A to be able to sort later the original sequence of the data, replace the needless null values
Make a unique list of values of column 3 (copy column 3 to a new sheet, remove duplicates or make a pivot table from it), it seems that valuable information is where data in this column start with PH, WC, AL and VL, you could double check this, give those a 1 in a separate column in the pivot/lookup table, and the non valuables a 0
3 Make an additional column in your data with VLOOKUP, so you can sort the valuable information, marked with a 1 on top, will give round 317 rows, delete all other lines.
Sort upon Date in column D, so all non-dates (27 lines) are listed latest, then move the data of those +/- 27 last rows from the next column E (which has a date) till column N, one column to the left, so it will come into column D to M
I did this in Excel, as I did not find the advanced filtering possibility in the Power BI Query Editor, also I do not know how to move part of the row-data to a different column in the query editor (probably missing knowledge), in Excel this is easy.

As your messy data does not have headers, I do not know which data is important to you, maybe you need only a few of the columns.

Hope those simple cleanup steps helps you further, obviously you can audit the remaining data well after the clean-up steps in Excel.
Kind regards,

Rafal_J · April 22, 2021, 9:49am

Hi Jan,
Thanks for your suggestions.

That is the point I am struggling with - information in some rows need to be moved to different columns, also part of initial post.

Unfortunately that is only partially true. Upon inspection, data in this Column include : information starting with XX like PH,WC et. al. code (which belong to separate column), but also for example “Kontrakt : XXXXXXX” which should stay in different column most of the time mentioned info is detected in next column to the right, also “0,XXXXX EUR” which is also relevant, and some irrelevant stuff.

I have attached the desired result data in my initial post.

As to the index - that is certainly a viable point. I can’t determine the general sequence of steps as yet; my intuition is that stabilizing the proper values in proper columns is the first thing to do here (then cleaning irrelevant stuff).

Once again thanks for time spent on this and all the best

DMercier · April 22, 2021, 4:23pm

@Rafal_J

Here is a video of transforming your first table… let me know if it helps pls…

Rafal_J · April 22, 2021, 8:06pm

Hi David,

Thanks for the vid !

It does the trick on this particular subset of rows, however I fear that trying to generalize that approach might lose the data.

I start to believe that maybe the way to go is creating number of subsets with separate transformations and then appending the resulting queries.

All the best

Keith · April 22, 2021, 8:22pm

just a thought again…load the actual pdf file and see if someone in the forum has experience using the actual pdf files instead of the messy data in an excel sheet.

you never know who might be able to use the actual pdf file…it’s your call

Rafal_J · April 22, 2021, 8:51pm

Just a quick addition - a pdf source itself.

All the best

Inv 1590274608.PDF (970.2 KB)

Matthias · April 24, 2021, 7:20pm

Hi @Rafal_J,

I spent a lot of time on the PDF with less then perfect results. Hopefully it is still better than what you had so far.

Tables: The relevant tables have 8-14 columns. There are some tables with a column caused by ść licznika which you could take out, but still you would have 8-13 columns and these are not following only one pattern for each number of column [i.e. same number of columns but different structure]. So it would be rather difficult to use tables. [Actually it might be rather quick to load all relevant tables individually to Excel and then manually add missing columns. It’s just that there are 72 relevant tables.]

Pages: Pages 2-25 are relative regular. You can ignore the last page, but the first page has 3 relevant tables. So you have to treat it individually and then append with pages 2-25. That works, but it is adds additional steps as these two blocks are having different structures.

It is not perfect, so what do you think about this approach:

Do the bulk of the work from the PDFs automatically
Do some corrections manually in Excel for each individual PDF/invoice
Combine the corrected Excel tables again with Power Query

I think that is not difficult for you as you understand the content. E.g. pretty sure that some exception causes like “ść licznika” are not relevant. Means not all lines that are off need to be corrected, they can be probably also filtered out.

Check out the result and if a manual correction is ok:
Inv 1590274608.xlsx (85.1 KB)

Rafal_J · April 24, 2021, 10:23pm

Hallo @Matthias,

Thank you for spending some time on this.

What I have learnt so far, is that the preferred way seems to be working with Pages instead of Tables - the reason is structural, there seems to be bit less messy content.

I agree that it might be easier to decompose it into “easy” and “hard” part

I even mentioned that earlier.

I cannot figure out the appropriate sequence of steps, meaning whether to deal with the offset first, or maybe filtering. There is a post which I find quite inspiring :

It seems that the kind of problem is similar, yet the logic of solution provided by @Melissa is pretty advanced. I am first to admit that it looks elegant and efficient, I will try to use it in this one - need to study it further though.

Your file is on (mostly) and off (sometimes), the guidance as to whats relevant and the desired structure is attached in my first post.

Once again thanks for your input