# Transforming a Messy Dataset (PDF source)

Hi Rafal,

I did some puzzling in Excel ( also not very experienced in Power Query),
It seems that you are looking for data whereby order numbers (??) bigger then 3.000.000.000 from column B contains your data.

A) I added column in Excel:

1. if Length = 10 and Nr > 3 billion, take the nr
2. if Length bigger then 10, but last 10 numbers > 3 billion, take the nr
3. Else not needed.
Excel formula =IF(AND(LEN(B5)=10,B5*1>3000000000),B5,IF(RIGHT(B5,10)*1>3000000000,RIGHT(B5,10)*1,IF(ISERROR(B5),“Text”,“not needed”)))
For some not required lines this formula still gives a -VALUE. (Values are not accepted in Power Query, gives errors), but at least it give the order numbers, starting with a 3.

B) Then you can concatenate text from the other columns in the row with the order numbers with formula =CONCATENATE(B2,",“B3,”,"B4), (text split by a comma). And load that in Power BI.

Far from ideal, the delivered data-output should be improved. (In my own experience, some external organizations were willing to deliver a query instead of PDF-files with headers)

Check out the April 2021 version of Power BI there is a text/CSV by example as a new feature, might get you what you need

nothing specific but there is vid using R within power bi and other videos that can help. Having the original pdf file is better. its just easier having the pdf file. thats all

@deltaselect
I thought about using a function separating numbers from text, that might do the trick for B column, but partially only, also considering dividing whole set of tables to subsets like „easy transformations” „edge cases”, applying different steps and then appending them together. Yet it wouldnt be a compelling solution.
@cmalone1112
Thanks I will check this out.
@Keith
I have seen one with R script used to create text from pdf, originated from biccountant blog, I am not sure if thats helpful there, other stuff looks to be showing the import techniques only

sometimes you just have play around to find the best solution. sometimes you have to take components of several videos or processes to get to the final results. Its just a matter of testing different routes

Hi Rafal,

I double checked your messy data.
It seems much more cleaned up by the following steps, done in Excel:

1. Make a index/ sequence numbering in column A to be able to sort later the original sequence of the data, replace the needless null values
2. Make a unique list of values of column 3 (copy column 3 to a new sheet, remove duplicates or make a pivot table from it), it seems that valuable information is where data in this column start with PH, WC, AL and VL, you could double check this, give those a 1 in a separate column in the pivot/lookup table, and the non valuables a 0
3 Make an additional column in your data with VLOOKUP, so you can sort the valuable information, marked with a 1 on top, will give round 317 rows, delete all other lines.
3. Sort upon Date in column D, so all non-dates (27 lines) are listed latest, then move the data of those +/- 27 last rows from the next column E (which has a date) till column N, one column to the left, so it will come into column D to M
I did this in Excel, as I did not find the advanced filtering possibility in the Power BI Query Editor, also I do not know how to move part of the row-data to a different column in the query editor (probably missing knowledge), in Excel this is easy.

As your messy data does not have headers, I do not know which data is important to you, maybe you need only a few of the columns.

Hope those simple cleanup steps helps you further, obviously you can audit the remaining data well after the clean-up steps in Excel.
Kind regards,

Hi Jan,

That is the point I am struggling with - information in some rows need to be moved to different columns, also part of initial post.

Unfortunately that is only partially true. Upon inspection, data in this Column include : information starting with XX like PH,WC et. al. code (which belong to separate column), but also for example “Kontrakt : XXXXXXX” which should stay in different column most of the time mentioned info is detected in next column to the right, also “0,XXXXX EUR” which is also relevant, and some irrelevant stuff.

I have attached the desired result data in my initial post.

As to the index - that is certainly a viable point. I can’t determine the general sequence of steps as yet; my intuition is that stabilizing the proper values in proper columns is the first thing to do here (then cleaning irrelevant stuff).

Once again thanks for time spent on this and all the best

@Rafal_J

Here is a video of transforming your first table… let me know if it helps pls…

1 Like

Hi David,

Thanks for the vid !

It does the trick on this particular subset of rows, however I fear that trying to generalize that approach might lose the data.

I start to believe that maybe the way to go is creating number of subsets with separate transformations and then appending the resulting queries.

All the best

just a thought again…load the actual pdf file and see if someone in the forum has experience using the actual pdf files instead of the messy data in an excel sheet.

you never know who might be able to use the actual pdf file…it’s your call

Just a quick addition - a pdf source itself.

All the best

Inv 1590274608.PDF (970.2 KB)

Hi @Rafal_J,

I spent a lot of time on the PDF with less then perfect results. Hopefully it is still better than what you had so far.

Tables: The relevant tables have 8-14 columns. There are some tables with a column caused by ść licznika which you could take out, but still you would have 8-13 columns and these are not following only one pattern for each number of column [i.e. same number of columns but different structure]. So it would be rather difficult to use tables. [Actually it might be rather quick to load all relevant tables individually to Excel and then manually add missing columns. It’s just that there are 72 relevant tables.]

Pages: Pages 2-25 are relative regular. You can ignore the last page, but the first page has 3 relevant tables. So you have to treat it individually and then append with pages 2-25. That works, but it is adds additional steps as these two blocks are having different structures.

1. Do the bulk of the work from the PDFs automatically
2. Do some corrections manually in Excel for each individual PDF/invoice
3. Combine the corrected Excel tables again with Power Query

I think that is not difficult for you as you understand the content. E.g. pretty sure that some exception causes like “ść licznika” are not relevant. Means not all lines that are off need to be corrected, they can be probably also filtered out.

Check out the result and if a manual correction is ok:
Inv 1590274608.xlsx (85.1 KB)

Hallo @Matthias,

Thank you for spending some time on this.

What I have learnt so far, is that the preferred way seems to be working with Pages instead of Tables - the reason is structural, there seems to be bit less messy content.

I agree that it might be easier to decompose it into “easy” and “hard” part

I even mentioned that earlier.

I cannot figure out the appropriate sequence of steps, meaning whether to deal with the offset first, or maybe filtering. There is a post which I find quite inspiring :

It seems that the kind of problem is similar, yet the logic of solution provided by @Melissa is pretty advanced. I am first to admit that it looks elegant and efficient, I will try to use it in this one - need to study it further though.

Your file is on (mostly) and off (sometimes), the guidance as to whats relevant and the desired structure is attached in my first post.

Once again thanks for your input

@Rafal_J, I wrote that instead of tables I went for pages. I treated Page 1 differently than pages 2-25, but also 2-25 have different structures with 15-18 columns including structure breaking excepetional columns all over the place. Therefore I assumed that manual corrections might be an option for these exceptions. But I am glad that you found an approach which doesn’t need that.

Hi @Rafal_J, did the response provided by the users and experts help you solve your query? If not, how far did you get and what kind of help you need further? If yes, kindly mark as solution the answer that solved your query. Thanks!

Hi @Rafal_J, we’ve noticed that no response has been received from you since the 25th of April. We just want to check if you still need further help with this post? In case there won’t be any activity on it in the next few days, we’ll be tagging this post as Solved.

Hi all,

Here is the summary of my efforts so far (in attachment). I hope that my conclusions will be helpful to some participants dealing with PDF data sources.

MessyDatasetInitialClean.pbix (75.0 KB)

1. Source

I decided to deal with Pages instead of Tables - rationale behind that is transforming the whole set and generalizing the steps, instead of picking the tables, good approach to sources including invoices, statements and other repetitive stuff et.al. probably not optimal for documents like reports with lots of text and only some tables.

2. GetData step

I filtered the Pages in, observed that last Page (No 26) contains no valuable information, therefore I used Table.RemoveLastN around filter to get rid of it - it is useful but probably dependent of particular set.

3. PositionalIndex step

I added an index column, which indexed every Page separately - my idea was to analyze the data to try and discover some patterns which might potentially ease separating useful info from the rest (in terms of getting rid of rows). Next I removed the Data column, expanded the TableWithIndex and moved index to the beginning for better perception. It took me three steps - probably someone more skilled can compress it in one ?

First I have thought about how to remove empty rows - having index which is obviously not empty, so I decided I would rather not at this moment; Instead I dealt with the first rows - it seemed that 23 of them contain some general info, the source being an invoice, so i got rid of them. Probably pretty useful step in every document of this type, I think that it might be even constant pattern throughout every appended document if it’s series of them.

5 Second Positional Index

I have noticed a pattern, that rows numbered 0 to 8 within each Page are not really needed, to filter them out I have created a conditional column dividing the set to positions less than or equal to 8 and others based on an index column value. I have also noticed that last two positions on every page are irrelevant - yet this being dynamic number, I couldn’t figure out how to filter them out based on that pattern.

6 Filtering Unwanted Content in Rows

I have created and advanced filter : first condition deals with 0-8 rows mentioned above, next six conditions deal with last two rows on every page (six because there is set of two unwanted values spread over three columns), last two deal with values which look to be spilt-over from other columns. On the plus side, looks like there are no null rows after that. Now, it looks a bit bulky, would love someone to suggest more elegant way to do that.

7 Conclusions and next steps

It is far from over, although number of rows got reduced from more than 1250 to about 300, there are still grave issues - spillovers everywhere; cleaning after dealing with that; unpivoting of attributes - to name the most obvious.

Anyway, I have fun with that and really enjoy trying to figure that out.

TBC

Hi Rafal_J,
Could you maybe clarify why it would not possible to obtain a better streamlined dataset ? First thing I would do is (persistently) deal with the supplier of the info, if the information is important.