How to determine partition column for large table extraction from source systems during Azure-based data migration?

Question

How to determine partition column for large table extraction from source systems during Azure-based data migration?

Janice Chi 100

In our data migration project (source: IBM DB2 → Azure), we are designing a control-table-based ingestion framework to handle large source tables (10–1500 GB range). For performance and checkpointing, each table is split into partitions, and data is extracted partition-wise.

Our challenge is identifying the right partition column for such large source tables when:

The natural partitioning column (e.g., last_updated_ts) is missing or unreliable

Table size is large and needs to be split across multiple extract runs

We want to keep the extraction scalable, checkpoint-safe, and resumable

Can Microsoft recommend best practices or guidelines for:

Identifying ideal partitioning columns when metadata is limited

Designing partition logic for high-volume extract pipelines (using ADF or Spark)

Handling cases where no datetime or numeric column exists for natural slicingIn our data migration project (source: IBM DB2 → Azure), we are designing a control-table-based ingestion framework to handle large source tables (10–1500 GB range). For performance and checkpointing, each table is split into partitions, and data is extracted partition-wise. Our challenge is identifying the right partition column for such large source tables when:
- The natural partitioning column (e.g., last_updated_ts) is missing or unreliable
- Table size is large and needs to be split across multiple extract runs
- We want to keep the extraction scalable, checkpoint-safe, and resumable
Can Microsoft recommend best practices or guidelines for:
1. Identifying ideal partitioning columns when metadata is limited
2. Designing partition logic for high-volume extract pipelines (using ADF or Spark)
3. Handling cases where no datetime or numeric column exists for natural slicing

J N S S Kasyap 3,300 Reputation points Microsoft External Staff Moderator

2025-06-10T12:25:55.3633333+00:00
@Janice Chi

In Azure Data Factory (ADF), efficient partitioning is key when extracting large tables. The Copy Activity supports a feature called "Source Partitioning" that enables parallel reads for faster and scalable ingestion.

Here's how to approach it:

Use ADF's built-in partition options ADF provides the following options under "Source → Source partitioning" in Copy Activity:

None: Default single-threaded read.

Dynamic Range: Auto-splits data across a numeric column.

Static Range: You manually define numeric/date ranges.

Column Value: Partitions by unique column values.

Physical Partitions: Uses source-side partitions (if supported).

If your table lacks reliable timestamp or numeric columns:

Use a surrogate partition: Add a ROW_NUMBER() or hash value via source query or staging logic and split by ranges.

Hash a text column: e.g., HASH(customer_id) % 10 to distribute data.

Leverage high-cardinality columns: Use IDs like txn_id, order_id, or batch indicators.

For complete control:

Create control tables with partition boundary metadata (e.g., min/max values per chunk).

Use ADF’s Lookup + ForEach pattern to iterate over partitions.

Apply static range partitioning dynamically per table.

Before partitioning a table, it's essential to sample a small percentage (typically 1–5%) of the data and analyze its distribution to understand patterns. This helps identify potential data concentration issues that could lead to skew. By checking for uneven distribution across certain ranges, partitioning can be optimized to ensure balanced workloads and efficient query performance.

Even if your source lacks ideal partitioning fields, Azure Data Factory gives you flexible options via Partition Option to handle large table extractions efficiently. Combine that with sampling, metadata-driven design, and fallback strategies like hashing or row numbers to ensure scalable, parallel, and resumable ingestion.

I hope this info helpful
Smaran Thoomu 24,015 Reputation points Microsoft External Staff Moderator

2025-06-11T03:58:33.4366667+00:00

@Janice Chi We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Smaran Thoomu 24,015 Reputation points Microsoft External Staff Moderator

2025-06-12T04:51:03.64+00:00

@Janice Chi We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

1 answer

Your answer

Smaran Thoomu 24,015 Reputation points Microsoft External Staff Moderator

2025-06-11T03:58:33.4366667+00:00

@Janice Chi We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Smaran Thoomu 24,015 Reputation points Microsoft External Staff Moderator

2025-06-12T04:51:03.64+00:00

@Janice Chi We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Answer 1

hi Janice Chi great question )) really like that one at this morning....

so, migrating big tables can be a real fun and pain at the some time ^)))))))) , especially when u dont have clear partition columns.

okaaay, for azure data factory, u wanna look at the 'data partitioning' features in their docs. they've got some smart ways to handle big extracts. if u got any numeric or date column, even if its not perfect, try using that first. aha, and check their 'parallel copy' docs too, it's magic for speeding things up )

no good timestamp? no problem )) try using row estimates or even artificial ranges. u can create a derived column that splits data into chunks based on rowcount. spark does this well with its partitioning strategies. here's a tiny spark example u might find useful: df.repartition(100, col("some_numeric_column"))

now for the universal tricks that work outside azure too... look for columns with high cardinality but not too crazy. think customer ids or order numbers. and hey, sometimes ugly solutions work best )) if nothing else fits, just hash a text column and split by that. its not elegant but gets the job done.

worth looking into db2's own system tables too. they often hide goldmines of metadata u can use for partitioning decisions. every database keeps secrets about its data distribution ))

microsoft's got this cool 'partition options' feature in adf that can auto detect some patterns for u. check their 'optimize performance' guide, its got neat examples. and remember, sometimes brute force works - split by primary key ranges if u must!

this might help in other tools too... when in doubt, sample first. grab 1% of data, analyze distribution, then plan partitions. saves tons of time versus guessing. as well check if your source db has any native export tools, db2 might have smarter ways to chunk data than we realize )))

ps: azure synapse has some slick partitioning helpers if u end up going that route. their docs on 'partitioning strategies for etl' are worth a quick peek :))

good luck with the migration! sounds like u're building something solid. hit me up if any part needs more detail %))))

Best regards,

Alex

and "yes" if you would follow me at Q&A - personaly thx.
P.S. If my answer help to you, please Accept my answer
PPS That is my Answer and not a Comment

https://ctrlaltdel.blog/

Janice Chi 100 Reputation points

2025-06-10T11:21:30.85+00:00

please explian in deatil about "microsoft's got this cool 'partition options' feature in adf that can auto detect some patterns for u. check their 'optimize performance' guide, its got neat examples. and remember, sometimes brute force works - split by primary key ranges if u must!"

Share via

How to determine partition column for large table extraction from source systems during Azure-based data migration?

1 answer

Your answer