Import from PDF where the PDF layout changes

Hi

I know how to import tables from a PDF.

My issue is that some of the pdf files I want to import from may or may not have say Sales data Table and if they do it may not have the same table number or be on a the same page as in the other files.

Can one have Power Query search for a table with a particular heading irrespective as to where it is in the pdf and then import it.

In the attached files I have imported the Sales Table from Report1.pdf but in the second query I manually had to determine which table it was on this occurrence of the Report2.pdf. (It may be that next time Report2 may have a new table inserted above the first table in the report and so the table name will change).

Once I get both tables I then want to merge them.

How can I get PQ to find and import the table I want from each file in a dynamic manner?

Kind Regards

Allister

Report1.pdf (16.6 KB)
Report2.pdf (19.0 KB)
Summary.xlsx (22.2 KB)

Hi @AllisterB,

In the new learning portal use the Search option (top right) and enter my name: Melissa
Look for this course: Applied Problem Solving with Power Query/M
and review this chapter: Convert a PDF to a Tabular Format

I’m confident this will help you work through your scenario.

All the best.

3 Likes

Hi Melissa

Thank you for your recommendation.

Are the files used in the course available for me to learn with? If so - where are they?

Kind Regards

Allister

There should be a resource pack in the course, please check that - sorry but it is a bit difficult, on mobile…

Hi Melissa

I cannot see the resource pack on the front screen of the course.

I have been working through the course to solve my initial posting. However, I have a page that has a table that only has one column so I get an error when I run the code that moves Column 2 (which doesn’t exist) to Column 1. I have tried using try to solve this.

How would teh cose iin your example change if one of the tables had only one column and it was in Column 1

Regards

Allister

let
    Source = Pdf.Tables(File.Contents(FileLocation), [Implementation="1.3"]),
    #"Filtered Rows" = Table.RemoveLastN(Table.SelectRows(Source, each ([Kind] = "Page")), 1 ),
    #"Added Custom" = Table.AddColumn(#"Filtered Rows", "ColCount", each Table.ColumnCount([Data])),
    #"Added Custom1" = Table.AddColumn(#"Added Custom", "Data2", each let
            RenameCols = Table.RenameColumns([Data],{{"Column1", "Temp1"}, {"Column2", "Temp2"}}),
            UpdateCol1 = try Table.AddColumn(RenameCols, "Column1", each if [Temp1] = null then [Temp2] else [Temp1]),
            UpdateCol2 = Table.AddColumn(UpdateCol1, "Column2", each if [Temp2] = [Column1] then null else [Temp2]),
            DelCols = Table.RemoveColumns(UpdateCol2,{"Temp1", "Temp2"})
        in
            DelCols)
in
    #"Added Custom1"

HI @AllisterB,

I will inquire with the eDNA team for the course resource pack. Thanks for reporting back :+1:

As for tables with 2 and sometimes 1 column, there is a set of table functions in M that allow for a MissingField.Type parameter, select one of those. These are the options:

Name Value Description
MissingField.Error 0 Indicates that missing fields should result in an error. (This is the default value.)
MissingField.Ignore 1 Indicates that missing fields should be ignored.
MissingField.UseNull 2 Indicates that missing fields should be included as null values.

.
If that isn’t possible, are you able to generate a list with column names that you should get (at most)?

1 Like

Hi @AllisterB you can now download the resources of the Convert a PDF to a Tabular Format video mentioned by @Melissa, along with all other resources of the course (if needed).

Thanks!

1 Like

This course was so very helpful. once you work it through a couple of times you get to feel as though you are learning techniques that will be useful in solving other issues - thank you Melissa

1 Like

Hi

When I l click on the Download button I get the attached.
Download result.docx (126.9 KB)

Appears to be an issue wi ith Firefox as it works with Chrome.