Transforming a Messy Dataset (PDF source)

Matthias · April 25, 2021, 1:38am

@Rafal_J, I wrote that instead of tables I went for pages. I treated Page 1 differently than pages 2-25, but also 2-25 have different structures with 15-18 columns including structure breaking excepetional columns all over the place. Therefore I assumed that manual corrections might be an option for these exceptions. But I am glad that you found an approach which doesn’t need that.

EnterpriseDNA · April 26, 2021, 1:34am

Hi @Rafal_J, did the response provided by the users and experts help you solve your query? If not, how far did you get and what kind of help you need further? If yes, kindly mark as solution the answer that solved your query. Thanks!

EnterpriseDNA · April 27, 2021, 1:25am

Hi @Rafal_J, we’ve noticed that no response has been received from you since the 25th of April. We just want to check if you still need further help with this post? In case there won’t be any activity on it in the next few days, we’ll be tagging this post as Solved.

Rafal_J · April 27, 2021, 9:59pm

Hi all,

Here is the summary of my efforts so far (in attachment). I hope that my conclusions will be helpful to some participants dealing with PDF data sources.

MessyDatasetInitialClean.pbix (75.0 KB)

1. Source

I decided to deal with Pages instead of Tables - rationale behind that is transforming the whole set and generalizing the steps, instead of picking the tables, good approach to sources including invoices, statements and other repetitive stuff et.al. probably not optimal for documents like reports with lots of text and only some tables.

2. GetData step

I filtered the Pages in, observed that last Page (No 26) contains no valuable information, therefore I used Table.RemoveLastN around filter to get rid of it - it is useful but probably dependent of particular set.

3. PositionalIndex step

I added an index column, which indexed every Page separately - my idea was to analyze the data to try and discover some patterns which might potentially ease separating useful info from the rest (in terms of getting rid of rows). Next I removed the Data column, expanded the TableWithIndex and moved index to the beginning for better perception. It took me three steps - probably someone more skilled can compress it in one ?

4 Document Header Removal

First I have thought about how to remove empty rows - having index which is obviously not empty, so I decided I would rather not at this moment; Instead I dealt with the first rows - it seemed that 23 of them contain some general info, the source being an invoice, so i got rid of them. Probably pretty useful step in every document of this type, I think that it might be even constant pattern throughout every appended document if it’s series of them.

5 Second Positional Index

I have noticed a pattern, that rows numbered 0 to 8 within each Page are not really needed, to filter them out I have created a conditional column dividing the set to positions less than or equal to 8 and others based on an index column value. I have also noticed that last two positions on every page are irrelevant - yet this being dynamic number, I couldn’t figure out how to filter them out based on that pattern.

6 Filtering Unwanted Content in Rows

I have created and advanced filter : first condition deals with 0-8 rows mentioned above, next six conditions deal with last two rows on every page (six because there is set of two unwanted values spread over three columns), last two deal with values which look to be spilt-over from other columns. On the plus side, looks like there are no null rows after that. Now, it looks a bit bulky, would love someone to suggest more elegant way to do that.

7 Conclusions and next steps

It is far from over, although number of rows got reduced from more than 1250 to about 300, there are still grave issues - spillovers everywhere; cleaning after dealing with that; unpivoting of attributes - to name the most obvious.

Anyway, I have fun with that and really enjoy trying to figure that out.

TBC

deltaselect · April 28, 2021, 12:51pm

Hi Rafal_J,
Could you maybe clarify why it would not possible to obtain a better streamlined dataset ? First thing I would do is (persistently) deal with the supplier of the info, if the information is important.

Your answer before:
Thanks for your time guys,
@deltaselect - this is a file stored locally, it imports in a standard PDF way, so the structure in the navigator is N tables + M pages. I guess the answer is no.

Rafal_J · April 28, 2021, 11:31pm

Sure,
The supplier of the info is persistently refusing to change anything.

EnterpriseDNA · May 6, 2021, 12:51am

Hi @Rafa_J! We noticed that your inquiry was left unsolved for quite some time now. We strive to answer ALL inquiries in the forum. However, if you are sure that you provided all pertinent context to your concerns and read how to use the forum more effectively and still find your question unanswered, you can check out tutorials to learn this yourself as your membership also comes with relevant resources that may help you with your Power BI education.

While our users and experts do as much as reasonable to help you with your inquiries, not all concerns can be attended to especially if there are some learnings to be done. Thank you!