How to determine partition column for large table extraction from source systems during Azure-based data migration?

Janice Chi 100 Reputation points
2025-06-10T07:48:15.46+00:00

In our data migration project (source: IBM DB2 → Azure), we are designing a control-table-based ingestion framework to handle large source tables (10–1500 GB range). For performance and checkpointing, each table is split into partitions, and data is extracted partition-wise.

Our challenge is identifying the right partition column for such large source tables when:

The natural partitioning column (e.g., last_updated_ts) is missing or unreliable

Table size is large and needs to be split across multiple extract runs

We want to keep the extraction scalable, checkpoint-safe, and resumable

Can Microsoft recommend best practices or guidelines for:

Identifying ideal partitioning columns when metadata is limited

Designing partition logic for high-volume extract pipelines (using ADF or Spark)

  1. Handling cases where no datetime or numeric column exists for natural slicingIn our data migration project (source: IBM DB2 → Azure), we are designing a control-table-based ingestion framework to handle large source tables (10–1500 GB range). For performance and checkpointing, each table is split into partitions, and data is extracted partition-wise. Our challenge is identifying the right partition column for such large source tables when:
    • The natural partitioning column (e.g., last_updated_ts) is missing or unreliable
    • Table size is large and needs to be split across multiple extract runs
    • We want to keep the extraction scalable, checkpoint-safe, and resumable
    Can Microsoft recommend best practices or guidelines for:
    1. Identifying ideal partitioning columns when metadata is limited
    2. Designing partition logic for high-volume extract pipelines (using ADF or Spark)
    3. Handling cases where no datetime or numeric column exists for natural slicing
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,582 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Alex Burlachenko 8,315 Reputation points
    2025-06-10T08:22:08.54+00:00

    hi Janice Chi great question )) really like that one at this morning....

    so, migrating big tables can be a real fun and pain at the some time ^)))))))) , especially when u dont have clear partition columns.

    okaaay, for azure data factory, u wanna look at the 'data partitioning' features in their docs. they've got some smart ways to handle big extracts. if u got any numeric or date column, even if its not perfect, try using that first. aha, and check their 'parallel copy' docs too, it's magic for speeding things up )

    no good timestamp? no problem )) try using row estimates or even artificial ranges. u can create a derived column that splits data into chunks based on rowcount. spark does this well with its partitioning strategies. here's a tiny spark example u might find useful: df.repartition(100, col("some_numeric_column"))

    now for the universal tricks that work outside azure too... look for columns with high cardinality but not too crazy. think customer ids or order numbers. and hey, sometimes ugly solutions work best )) if nothing else fits, just hash a text column and split by that. its not elegant but gets the job done.

    worth looking into db2's own system tables too. they often hide goldmines of metadata u can use for partitioning decisions. every database keeps secrets about its data distribution ))

    microsoft's got this cool 'partition options' feature in adf that can auto detect some patterns for u. check their 'optimize performance' guide, its got neat examples. and remember, sometimes brute force works - split by primary key ranges if u must!

    this might help in other tools too... when in doubt, sample first. grab 1% of data, analyze distribution, then plan partitions. saves tons of time versus guessing. as well check if your source db has any native export tools, db2 might have smarter ways to chunk data than we realize )))

    ps: azure synapse has some slick partitioning helpers if u end up going that route. their docs on 'partitioning strategies for etl' are worth a quick peek :))

    good luck with the migration! sounds like u're building something solid. hit me up if any part needs more detail %))))

    Best regards,

    Alex

    and "yes" if you would follow me at Q&A - personaly thx.
    P.S. If my answer help to you, please Accept my answer
    PPS That is my Answer and not a Comment
    

    https://ctrlaltdel.blog/


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.