PAD - AWS Invoice PDF Scan - how to remove extra characters

dc1128 · September 8, 2023, 6:11pm

Hi All,
I have a PA Desktop flow that reads AWS invoice PDF which extracts a few key fields (Account Number, Invoice Number, Original Invoice Number, PO Number, Total Amount, Invoice Date, Billing Period). It outputs the scanned result to a CSV which feeds to a Power BI Report. For the most part it’s all good but I struggle anything with dates such as Invoice Date and Billing Period - there is always that additional character(s) from the next line. I tried regular expression, Trim - just can’t get it clean perfectly in PAD. For now, I use Power BI Transform and replace the entries manually or modify the M query but that doesn’t resolve the issue. I’m wondering perhaps the only way to clean this perfectly is to use scripts. Initially I was using cloud flows including using AI Builder (could never get pass 49%). Attached see attached pdf for the screenshots, all numbers have been masked out but with actual character length. Any suggestion is greatly appreciated!

PAD_PDF-Invoice-Flow-screenshots.pdf (343.9 KB)

dc1128 · September 15, 2023, 9:37pm

I fixed the dates issue using “Text to Columns”

Adrian1 · June 23, 2024, 11:38pm

This is probably already solved, but for future reference, I suggest looking into RegEx which PAD can natively use (unlike PA cloud). Parse your text using RegEx patterns is the easiest route to retrieving what you need from a structured data source (like PDF in this case)