Retain ability to join after reshaping dimension tables

JulieCarignan · February 12, 2025, 11:56pm

Hi All,
I’ve got a dataset that has five columns of a dimension table that need substantial reshaping. Since this will be a large dataset, to minimize the data that has to be manipulated during reshaping steps (indexing, grouping, pivoting etc.), I removed all the other columns and reshaped the 5 columns as needed, resulting in about 20 sparsely populated pivoted columns.

My next step is to transform the dataset into a star schema.

When I’d normally merge the re-shaped dimension table with the fact table, none of the columns that would normally be used as compound keys for the join exist anymore since they’ve been pivoted and now have different names.

I’m at a loss how to proceed from here. I haven’t seen this addressed in trainings or videos.

Attached generalized file: The last 2 groups of queries are the ones I’m asking about.

License Charge Analysis Generalized.xlsx (164.7 KB)
License Charge Analysis Generalized_Simplified_backup.pbix (97.0 KB)

HufferD · February 18, 2025, 2:48pm

@JulieCarignan,

I’m a bit unclear about both what you’re trying to do and why. Not that it’s inappropriate, but when I can’t find anything written on what I’m trying to do, it often means I need to rethink my plan

From what I understand, you started with a single dimension table and needed to restructure the 5 columns you retain. That resulted in 20 separate, pivoted dimension tables that no longer contain the original keys. If that’s the case, you’ll need to recreate those keys or generate new surrogate keys. This might involve concatenating the columns that originally formed the key or generating a new unique identifier that will link the pivoted tables back to your fact table.

Are there any common columns among your pivoted tables that could be used to help link them back to your fact table? Would it be possible to combine some or all of the pivoted tables into a single dimension table after reshaping? This would help retain a unified key structure and simplify the relationships with your fact table.

Alternatively, could you link the pivoted tables back to your fact table in a chain of relationships?

Lastly, is it absolutely necessary to break all the data into 20 separate tables? Depending on your needs, it might be simpler to work with fewer pivoted tables or even retain some of the original structure in your dimension table.

JulieCarignan · February 19, 2025, 9:33pm

@HufferD

You’ve given me food for thought, I really appreciate that.

Your first paragraph made me chuckle; yes, I do think I need to rethink my plan.

Maybe it would help if I made it clear that the columns I showed in the image were only a subset of the columns provided by the database view that’s my data source; there were also a handful of straightforward numeric columns that will end up in the Fact table. I only showed the columns that have the odd data structure I’m struggling to transform.

How to somehow retain the key that the fact table can recognize as a match when joining is the issue I’ve got to figure out.

Since I wrote my original post, good news is that I convinced the database team who owns this data to include the primary key field in the views that are my data sources, so that will help. They also provided a couple of extra views that avert the need for some of the most numerous of the transformations required (they’re already unpivoted). But that introduces a new challenge: since now 3 tables that each have different data structures must somehow be merged in addition to having to figure out how to pivot the remaining records.

Is it clear that the issue is that each value in the each of the 5 Details columns has to be split by delimiter, with the prefix (part before the delimiter) becoming the column header for post-delimiter part of the values?

I believe it IS necessary to do that reshaping and separating into separate columns because it’s the prefixes my management wants to slice by (Name, Agency, Licenses, Role, Circuit, Project etc.) as they seek ways cut reduce costs.

I’ve been assuming that I should do all of the performance-affecting data structure reshaping in the dimension table or a helper query so that all the fact data doesn’t also have to be shuffled during all the transformations.

But perhaps the transformations MUST be done in a parent query that feeds both the future dim and fact tables so that the linking field isn’t lost.

I’ve actually accomplished doing this by a long series of repetitive conditional statements, but having seen @Melissa demo amazing transformations with helper queries turned into subqueries and custom functions, I hoped to find such a solution in hopes it would be a more robust solution.

Well, thank you for your ideas, @HufferD .

I wish I could ask an AI model how it would solve this, but that’s prohibited in my organization.

Julie

pranamg · March 3, 2025, 4:42am

Below is a strategy you might consider to address your challenges:

1. Preserve the Original Key Early On

Why it Matters:
When you pivot or unpivot data, the original compound key (or its components) can disappear if not explicitly carried along. Since your fact table will join to the dimension table, you need an immutable, consistent key.

How to Do It:

Include a Surrogate or Primary Key: Work with your database team to ensure that a primary key (or a surrogate key) is present in your source views. This key should be carried through all transformation steps.
Retain the Key in Helper Queries: In your initial staging or helper queries, ensure that the key is not transformed or removed. Instead, join it back after pivoting/unpivoting.

2. Use a Staging or Parent Query for Consistent Transformation

Why it Matters:
By centralizing the transformation logic, you reduce the chance of losing key fields and avoid redundant data reshaping on the fact side.

How to Do It:

Common Table Expressions (CTEs) or Subqueries: Create a parent query (or a CTE) that does the heavy-lifting of splitting the 5 Details columns by delimiter. In this query, generate two outputs:
- One output for the dimension table, including the reshaped (pivoted) columns plus the preserved key.
- A second output that can feed into the fact table, carrying forward the join key as well as the numeric measures.
Single Transformation Pipeline: If you pivot first, then join back the key field via a subquery, you can guarantee that every row in the dimension table has a matching key to join on.

3. Replace Repetitive Conditional Statements with Dynamic or Modular Logic

Why it Matters:
Long lists of repetitive conditionals are hard to maintain and can be inefficient, especially if the transformation logic changes or grows.

How to Do It:

Mapping Tables or UDFs: Create a mapping that associates each detail prefix (e.g., “Name”, “Agency”, “Licenses”, etc.) with a target column name. Then, use a user-defined function (UDF) or a dynamic SQL approach to iterate over these mappings.
Dynamic Pivoting: If your SQL dialect supports it, you can use dynamic pivoting to generate the column names on the fly while ensuring that the join key remains attached. This reduces hard-coding and simplifies maintenance.

4. Merging Multiple Data Structures

Why it Matters:
You now have three different data structures (views) that need to be merged. Each one might have its own format for the details.

How to Do It:

Standardize in a Staging Area: Before joining, create a staging query that standardizes the column names and ensures the join key is present in each view.
Union or Join Operations: Depending on whether the extra views are complementary or overlapping, use UNION (to combine rows) or JOIN (to merge columns) based on the preserved key field.
Consistent Transformation: Apply your pivot/unpivot logic in this staging area so that once the data is merged, you have a uniform structure with the key intact.

5. Performance Considerations

Why it Matters:
Transformations (especially pivoting and string splitting) can be costly on large datasets.

How to Do It:

Minimize Data Shuffling: As you suspected, try to confine heavy transformations to a smaller helper or staging table that only includes necessary columns (key + transformed details).
Do Not Duplicate Work on the Fact Table: Perform all the reshaping in a single parent query, and then join with the fact table using the preserved key. This avoids duplicative processing of the large fact table.

Summary

Your goal is to ensure that, even after splitting and pivoting the details columns, the join key remains available for merging with the fact table. You can achieve this by:

Preserving a primary/surrogate key from the start,
Centralizing your transformations in a parent query or CTE that feeds both the dimension and fact tables,
Replacing lengthy conditional logic with modular dynamic solutions (like UDFs or mapping tables),
Standardizing and merging multiple views in a staging area to maintain consistency.

This strategy not only ensures that your join key isn’t lost during transformation but also enhances maintainability and performance of your ETL process.

If you need more specific code examples or further discussion on any of these points, let me know!

JulieCarignan · March 5, 2025, 9:03pm

This is helpful.
I’m working in Power Query rather than SQL, but I can adapt the concepts.

I’ve made a great deal of progress on the transformation since I wrote this post.

A key discovery that enabled a breakthrough was that the Table.Combine function can combine tables that have completely different data structures.

That function enables me to split the dataset into to multiple temporary tables that each need different transformations, thus reducing the amount of data being manipulated. Then Table.Combine can re-combine them even though each table ends up with all different columns.

Another help was asking the database owner to add the primary key to each of the views that were my data sources, eliminating the complexity of using compound keys to join the various tables. That seems obvious in retrospect, but initially I made an assumption that wouldn’t be possible.

I hope this helps someone who might struggle with a similar situation.

Julie