EIGHT HUNDRED KAFKA TOPICS PROCESSING BY DBR

Janice Chi 100 Reputation points
2025-06-11T07:06:44.0833333+00:00

We are working on a large-scale Change Data Capture (CDC) implementation where: The source system is IBM DB2. IBM InfoSphere CDC pushes changes to Kafka, with each Kafka topic representing one DB2 table. There are 800 Kafka topics in total,

please note that we have two stages one is Catch up ( offset based after historical load) and then we have NRT whic will be waremark/checkpoint based

Q1. do we need to process each topic individually.

The target architecture includes: Bronze layer (raw data) on ADLS, where we store flattened JSON data. Silver layer (processed data) with MERGE-based logic (Insert/Update/Delete).

Databricks is used for all transformations and processing. Azure Data Factory orchestrates workflows. Each Kafka topic has a distinct schema, and a control table maintains metadata for all topics (e.g., business unit, topic name, target table, Start offset, End offset, primary keys, etc.).

❓ Question to Microsoft: Given this setup, Q2.what is the best practice for implementing a scalable and maintainable catch-up CDC framework across 800 Kafka topics with distinct schemas? for or Catch up ( offset based after historical load)

Q3. Q2.what is the best practice for implementing a scalable and maintainable catch-up CDC framework across 800 Kafka topics with distinct schemas? NRT whic will be waremark/checkpoint based

Specifically: Should we create 800 individual notebooks for Bronze (Kafka-to-ADLS) and another 800 for Silver (MERGE logic)? Or is there a recommended pattern to process each topic/table generically?

How can we optimize this pipeline for schema variation, parallelism, and operational efficiency — particularly for catch-up loads across all 800 topics?

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,492 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Smaran Thoomu 24,095 Reputation points Microsoft External Staff Moderator
    2025-06-11T08:04:03.1533333+00:00

    Hi @Janice Chi
    As I understand you're dealing with a complex but very common pattern in large-scale CDC pipelines. Below are suggestions to address your three questions:

    Do we need to process each topic individually?

    No, not manually. While each Kafka topic has a distinct schema, you don’t need 800 separate notebooks. Instead, you can implement a metadata-driven framework using your control table to drive logic dynamically. Each topic can be processed using a generic notebook or pipeline that:

    • Reads parameters from the control table (e.g., topic name, source/target, schema details, offset/checkpoint info)
    • Dynamically applies schema and logic using Spark’s capabilities (like from_json, merge, etc.)

    Best practice for scalable and maintainable catch-up CDC (offset-based)?

    For catch-up mode (after historical load), here are some best practices:

    • Use batch Structured Streaming in micro-batch mode by setting startingOffsets and endingOffsets based on your control table
    • Dynamically generate the schema from schema registry or embedded schema (if available), or predefine schemas in your metadata table
    • Use a single notebook that loops through the control table entries, processes each topic in parallel using concurrent.futures, multithreading, or Databricks workflows with task parallelism
    • Store raw flattened JSON to Bronze, then apply consistent logic in Silver using MERGE driven by metadata (e.g., primary keys, op type)

    Best practice for scalable NRT (watermark/checkpoint-based)?

    In Near Real-Time mode:

    • Use Structured Streaming with checkpointing and watermarking
    • Again, drive everything from metadata - let one notebook or job process topics dynamically
    • Partition your load by business unit or ___domain to optimize performance (e.g., 10–15 topics per cluster/job)
    • Consider using Autoloader + trigger-once mode for frequent NRT updates, depending on latency needs

    Operational Efficiency Recommendations

    • Avoid 800 notebooks - aim for 1-2 parameterized notebooks for Bronze/Silver layers
    • Maintain separate cluster pools or use autoscaling for different topic groups to parallelize
    • Monitor pipeline health using logging in the control table (e.g., last processed offset, status, error message)
    • Use ADF pipelines or Databricks Workflows to orchestrate batches by group.

    Optimization Tips for Schema Variation, Parallelism, and Operational Efficiency

    1. Schema Handling: Define schema ___location/type in control table. You can use:
      • JSON schema stored in ADLS per topic
      • Auto schema inference with constraints
      • Confluent schema registry (if used)
    2. Parallelism:
      • Group topics logically and process in batches via ADF or Databricks job clusters
      • Use cluster pools or autoscaling to manage resource cost and execution time
    3. Operational Efficiency:
      • Track processing status (offsets, run status, errors) in the control table
      • Enable logging and alerting for failed topic runs
      • Monitor via Databricks job runs or custom dashboards

    I hope this information helps. Please do let us know if you have any further queries.


    Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.