Hi all,
Here is the summary of my efforts so far (in attachment). I hope that my conclusions will be helpful to some participants dealing with PDF data sources.
MessyDatasetInitialClean.pbix (75.0 KB)
1. Source
I decided to deal with Pages instead of Tables - rationale behind that is transforming the whole set and generalizing the steps, instead of picking the tables, good approach to sources including invoices, statements and other repetitive stuff et.al. probably not optimal for documents like reports with lots of text and only some tables.
2. GetData step
I filtered the Pages in, observed that last Page (No 26) contains no valuable information, therefore I used Table.RemoveLastN around filter to get rid of it - it is useful but probably dependent of particular set.
3. PositionalIndex step
I added an index column, which indexed every Page separately - my idea was to analyze the data to try and discover some patterns which might potentially ease separating useful info from the rest (in terms of getting rid of rows). Next I removed the Data column, expanded the TableWithIndex and moved index to the beginning for better perception. It took me three steps - probably someone more skilled can compress it in one ?
4 Document Header Removal
First I have thought about how to remove empty rows - having index which is obviously not empty, so I decided I would rather not at this moment; Instead I dealt with the first rows - it seemed that 23 of them contain some general info, the source being an invoice, so i got rid of them. Probably pretty useful step in every document of this type, I think that it might be even constant pattern throughout every appended document if it’s series of them.
5 Second Positional Index
I have noticed a pattern, that rows numbered 0 to 8 within each Page are not really needed, to filter them out I have created a conditional column dividing the set to positions less than or equal to 8 and others based on an index column value. I have also noticed that last two positions on every page are irrelevant - yet this being dynamic number, I couldn’t figure out how to filter them out based on that pattern.
6 Filtering Unwanted Content in Rows
I have created and advanced filter : first condition deals with 0-8 rows mentioned above, next six conditions deal with last two rows on every page (six because there is set of two unwanted values spread over three columns), last two deal with values which look to be spilt-over from other columns. On the plus side, looks like there are no null rows after that. Now, it looks a bit bulky, would love someone to suggest more elegant way to do that.
7 Conclusions and next steps
It is far from over, although number of rows got reduced from more than 1250 to about 300, there are still grave issues - spillovers everywhere; cleaning after dealing with that; unpivoting of attributes - to name the most obvious.
Anyway, I have fun with that and really enjoy trying to figure that out.
TBC