Combining parquet files in a Sharepoint

Roberto · April 5, 2022, 2:16pm

Hi @sam.mckay ,
I’ve about 500 parquet files in a SharePoint I need to retrieve.
I’m used to handling CSV files using functions like

= Table.AddColumn(#“-- staging CSV content”, “Custom”, each Table.PromoteHeaders ( Csv.Document([Content])))

With parquet I should use Parquet.Document function, but I get a Parameter.Error: Parquet.Document cannot be used with streamed binary values. when I try to use the content column of the table.

It seems Parquet.Document requires a real file and not a binary stream.

How can I overcome this?

Unfortunately I cannot provide a model, since I cannot remove references to company information.

Thanks for your help

Roberto

rajender1984 · April 5, 2022, 3:30pm

Hi @Roberto ,

since we don’t have dummy data set as well. could you please cross check if given link helps - https://www.youtube.com/watch?v=5hCznl9tOsk

Also please try to use below mentioned formula ?

= Table.AddColumn(#“Removed Other Columns”, “Custom”, each Parquet.Document([Content], [Compression=null, LegacyColumnNameEncoding=false, MaxDepth=null]))

Roberto · April 5, 2022, 5:08pm

Hi @rajender1984,
thanks for your reply.
I had seen the video but he use the combine and I need to implement the incremental refresh as I already did with CVS. Overall the dataset is more than 2 billion rows I update on a daily basis, adding the new data for the day.
I cannot replicate the Sharepoint but I got a file from kaggle and split it into two parquet files.
I’ve also created a simple pbix but it uses a folder as a source and this time the Parquet.Document works just fine.
Replicating the same but using a SharePoint the issue pops up again.

Thanks for help

Roberto

collecting parquet files in sharepoint.pbix (47.7 KB)

penguins_lter1.parquet (9.9 KB)
penguins_lter2.parquet (9.9 KB)

Roberto · April 6, 2022, 8:20am

Hi @rajender1984,
after I slept on it and some googling I found this article from [Chris Webb’s blog] where I found the solution: before using the Parquet.Document I had to use the Binary.Buffer

Parquet.Document(Binary.Buffer([Content])))

The caveat is that parquet files can be big and the ETL can fail for memory issues.

Thanks

Roberto

(https://blog.crossjoin.co.uk/2021/03/07/parquet-files-in-power-bi-power-query-and-the-streamed-binary-values-error/)