Replace blanks in dataset - best practice?

JulieCarignan · December 24, 2024, 4:27pm

When doing ETL using Power Query, is it best practice to replace all nulls and blanks/empty strings with a word such as “Unspecified”? I assume this would be done in a cleaning query that is referenced by both the downstream fact table and dimension tables.

Or should only blanks be replaced?
And is it better that they be replaced with nulls or with a word such as “Unspecified”?

In my dataflows, replacing all blanks and nulls results in thousands of “Unspecified” values repeated throughout the dataset, most of which eventually get removed anyway when the dimension tables get de-duplicated before adding the index column. Is that no problem due to the efficiency of the Vertipaq engine?

Note about the blanks in my data:
One of my primary data sources contains help desk tickets that I get from a data lake (SQL database to which I don’t have write access). Some ticket fields aren’t populated until later in the ticket’s lifecycle, like its Status and which group and technician it’ll be Assigned to. Other fields are sometimes never populated such as the Category, Configuration or other fields that are irrelevant to certain types of tickets or issues.

So it’s not a matter of getting data from a spreadsheet that has nulls or blanks because of merged cells or values that could be filled down.

Do you know of any downsides or cautions related to following that practice?

I hoped to find this answer in the Definitive Guide to Power Query but couldn’t find an answer there.

Julie

Melissa · December 25, 2024, 2:53pm

@JulieCarignan

A null simply means the absence of a value, and it’s not the same as a “blank” (an empty text string). Both nulls and blanks can hold meaning in your data - they’re not “errors”, just indicators.

Whether they should be treated as equivalent depends on the context. That’s a decision we, as humans, need to make. When deemed equal, one value needs to be replaced with the other for consistency.

That said, nulls can sometimes cause issues depending on your workflow. For example:

Nulls are removed during an Unpivot.
Nulls will be propagated in calculations.

In DAX, it’s also good to note that blanks and empty strings (“”) are not always treated the same. Here’s a fun quiz to test your knowledge.

Merry Christmas
I hope this is helpful.