Combining Many Large Files and Grouping Rows

WarrenWilbraham · April 24, 2023, 9:16am

Hi EDNA Community

I have 75 .csv files of around 35mb each that I am trying to combine into a single dataset.

I am using the Folder/Combine Files function - this seems to work.

At the moment the data is all in one column and needs to be grouped and extracted.

The raw data set is 75m rows - 1m rows per file. I then group every 7 rows together and delimit them by position to create a record in one row. Eventually I should land up with just under 11m rows.

I have moved this to a Virtual machine as my laptop is under powered and kept on freezing - this only happens when the I add the detailed power query though - so it seems that the combine files per folder works, but the power query that converts the single column into a record with multiple columns per row does not when the data gets big.

The attached file is what i want and also shows the power query works on smaller datasets.

I am unable to request the data in a better format.

I also don’t have the SQL skills to manipulate the source data which, before splitting it up into 75 files, is 2.3GB. I know that the upload file limit is 1GB.

Looking forward to anyone who can help with this.

BSAS Model A12 Sample.pbix (258.4 KB)

AntrikshSharma · April 24, 2023, 9:46am

@WarrenWilbraham So you already have a database, and when you extract data into csv, the data is in one single column?

WarrenWilbraham · April 24, 2023, 9:58am

Hi Antriksh

The data has been provided by a client in .csv format so I don’t have direct access to it. I am using the .csv files to create a large dataset in power bi for analysis.

Thanks
W

AntrikshSharma · April 24, 2023, 12:09pm

@WarrenWilbraham Ok, but when you are done with the project and hand over the solution, the datasource will be SQL right?

WarrenWilbraham · April 24, 2023, 12:15pm

@AntrikshSharma - no, the datasource needs to be the .csv files that I have. If I can get them into another format to read into the PowerBI model that also works. I am only handing the model back to the client.

AntrikshSharma · April 24, 2023, 12:49pm

@WarrenWilbraham Ok, the problem with the code is you are adding Index column multiple times and it is added the first time for 75m rows, the problem is all of this needs to be done in Memory/RAM, so whatever you do it is going to be slow and will take forever to complete.

Also, I don’t know if you can trust the grouping based on 7 rows, what if the data in one of the file is inconsistent?

That means even after you are done they are still going to rely on CSV even when they have a data warehouse? That seems silly.

You should see if Datamarts can be leveraged here.

WarrenWilbraham · April 24, 2023, 1:07pm

Thanks @AntrikshSharma. I understand the RAM/indexing problem and it is consistent with my experience of the conversion. Is there any other way to group the rows? There is normally more than one way to solve the problem in the Power Query and DAX. They are not going to use the .csv source files. They only need the model. I am happy that the 7 rows is consistent. How would the Datamart work?

AntrikshSharma · April 24, 2023, 7:34pm

@WarrenWilbraham https://learn.microsoft.com/en-us/power-bi/transform-model/datamarts/datamarts-overview

It is like loading data using PQ into an instance of SQL Server automatically managed by Microsoft. It will act as a datbase and will allow query folding.

WarrenWilbraham · April 25, 2023, 3:57am

I refined the query but still no luck - will try this @AntrikshSharma - Thanks

EnterpriseDNA · April 26, 2023, 9:14am

Hello @WarrenWilbraham

Good to see that you are having progress with your inquiry. Did the response from @AntrikshSharma help you solve your inquiry?

If it does, kindly mark his answer as the Solution to your query.

If not, how far did you get and what kind of help do you need further?

WarrenWilbraham · April 26, 2023, 11:10am

Thanks @AntrikshSharma . The proposal worked but need to figure out how to combine multiple files as I am still limited to a 1GB upload file and i have files 2.3 and 1.7GB respectively - but like where the datamart can lead.