EIGHT HUNDRED KAFKA TOPICS PROCESSING BY DBR

Question

EIGHT HUNDRED KAFKA TOPICS PROCESSING BY DBR

Janice Chi 100

We are working on a large-scale Change Data Capture (CDC) implementation where: The source system is IBM DB2. IBM InfoSphere CDC pushes changes to Kafka, with each Kafka topic representing one DB2 table. There are 800 Kafka topics in total,

please note that we have two stages one is Catch up ( offset based after historical load) and then we have NRT whic will be waremark/checkpoint based

Q1. do we need to process each topic individually.

The target architecture includes: Bronze layer (raw data) on ADLS, where we store flattened JSON data. Silver layer (processed data) with MERGE-based logic (Insert/Update/Delete).

Databricks is used for all transformations and processing. Azure Data Factory orchestrates workflows. Each Kafka topic has a distinct schema, and a control table maintains metadata for all topics (e.g., business unit, topic name, target table, Start offset, End offset, primary keys, etc.).

❓ Question to Microsoft: Given this setup, Q2.what is the best practice for implementing a scalable and maintainable catch-up CDC framework across 800 Kafka topics with distinct schemas? for or Catch up ( offset based after historical load)

Q3. Q2.what is the best practice for implementing a scalable and maintainable catch-up CDC framework across 800 Kafka topics with distinct schemas? NRT whic will be waremark/checkpoint based

Specifically: Should we create 800 individual notebooks for Bronze (Kafka-to-ADLS) and another 800 for Silver (MERGE logic)? Or is there a recommended pattern to process each topic/table generically?

How can we optimize this pipeline for schema variation, parallelism, and operational efficiency — particularly for catch-up loads across all 800 topics?

Janice Chi 100 Reputation points

2025-06-11T10:22:01.2833333+00:00

please explian "Dynamically applies schema and logic using Spark’s capabilities (like from_json, merge, etc.)"--------------- how we read schema of each topic -- do we need to store schema of each topic in some place --- I beleieve No ------------
Smaran Thoomu 24,095 Reputation points Microsoft External Staff Moderator

2025-06-12T04:01:21.8333333+00:00
Hi @Janice Chi
Thanks for the follow up.

To clarify: You can process messages without pre-stored schema if:

Your payloads are consistent, self-describing JSON (with keys and values embedded),

And you're just writing raw data (e.g., Bronze layer).

In that case, Spark can infer the schema on the fly or store as-is.

But for NRT and catch-up stages - especially for Silver (MERGE) logic - schema awareness is recommended. Why?

Because for operations like MERGE, you need field-level structure (e.g., keys, update columns), and dynamic schema helps:

Avoid hardcoding logic for each topic

Handle schema evolution

Improve validation & error handling

How to read/apply schema per topic?

Here are 3 common approaches (based on project maturity):

Predefined schemas in control table - Add a schema column (e.g., JSON string or DDL) - Load using DataType.fromJson() or Spark SQL parser.

Store schema files in ADLS per topic - Use standard path logic like /schemas/{topic_name}.json

Use schema registry - If you're using Confluent Kafka or similar, retrieve Avro/JSON schema directly.

You're correct - you don’t have to store schemas for raw ingestion. But for scalable CDC frameworks with 800+ topics and silver logic, storing schemas or retrieving them dynamically is a best practice for long-term maintainability.

I hope this information helps. Please do let us know if you have any further queries.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

1 answer

Your answer

Janice Chi 100 Reputation points

2025-06-11T10:22:01.2833333+00:00

please explian "Dynamically applies schema and logic using Spark’s capabilities (like from_json, merge, etc.)"--------------- how we read schema of each topic -- do we need to store schema of each topic in some place --- I beleieve No ------------
Smaran Thoomu 24,095 Reputation points Microsoft External Staff Moderator

2025-06-12T04:01:21.8333333+00:00

Hi @Janice Chi
Thanks for the follow up.

To clarify: You can process messages without pre-stored schema if:

Your payloads are consistent, self-describing JSON (with keys and values embedded),

And you're just writing raw data (e.g., Bronze layer).

In that case, Spark can infer the schema on the fly or store as-is.

But for NRT and catch-up stages - especially for Silver (MERGE) logic - schema awareness is recommended. Why?

Because for operations like MERGE, you need field-level structure (e.g., keys, update columns), and dynamic schema helps:

Avoid hardcoding logic for each topic

Handle schema evolution

Improve validation & error handling

How to read/apply schema per topic?

Here are 3 common approaches (based on project maturity):

Predefined schemas in control table - Add a schema column (e.g., JSON string or DDL) - Load using DataType.fromJson() or Spark SQL parser.

Store schema files in ADLS per topic - Use standard path logic like /schemas/{topic_name}.json

Use schema registry - If you're using Confluent Kafka or similar, retrieve Avro/JSON schema directly.

You're correct - you don’t have to store schemas for raw ingestion. But for scalable CDC frameworks with 800+ topics and silver logic, storing schemas or retrieving them dynamically is a best practice for long-term maintainability.

I hope this information helps. Please do let us know if you have any further queries.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

Answer 1

Hi @Janice Chi
As I understand you're dealing with a complex but very common pattern in large-scale CDC pipelines. Below are suggestions to address your three questions:

Do we need to process each topic individually?

No, not manually. While each Kafka topic has a distinct schema, you don’t need 800 separate notebooks. Instead, you can implement a metadata-driven framework using your control table to drive logic dynamically. Each topic can be processed using a generic notebook or pipeline that:

Reads parameters from the control table (e.g., topic name, source/target, schema details, offset/checkpoint info)
Dynamically applies schema and logic using Spark’s capabilities (like from_json, merge, etc.)

Best practice for scalable and maintainable catch-up CDC (offset-based)?

For catch-up mode (after historical load), here are some best practices:

Use batch Structured Streaming in micro-batch mode by setting startingOffsets and endingOffsets based on your control table
Dynamically generate the schema from schema registry or embedded schema (if available), or predefine schemas in your metadata table
Use a single notebook that loops through the control table entries, processes each topic in parallel using concurrent.futures, multithreading, or Databricks workflows with task parallelism
Store raw flattened JSON to Bronze, then apply consistent logic in Silver using MERGE driven by metadata (e.g., primary keys, op type)

Best practice for scalable NRT (watermark/checkpoint-based)?

In Near Real-Time mode:

Use Structured Streaming with checkpointing and watermarking
Again, drive everything from metadata - let one notebook or job process topics dynamically
Partition your load by business unit or ___domain to optimize performance (e.g., 10–15 topics per cluster/job)
Consider using Autoloader + trigger-once mode for frequent NRT updates, depending on latency needs

Operational Efficiency Recommendations

Avoid 800 notebooks - aim for 1-2 parameterized notebooks for Bronze/Silver layers
Maintain separate cluster pools or use autoscaling for different topic groups to parallelize
Monitor pipeline health using logging in the control table (e.g., last processed offset, status, error message)
Use ADF pipelines or Databricks Workflows to orchestrate batches by group.

Optimization Tips for Schema Variation, Parallelism, and Operational Efficiency

Schema Handling: Define schema ___location/type in control table. You can use:
- JSON schema stored in ADLS per topic
- Auto schema inference with constraints
- Confluent schema registry (if used)
Parallelism:
- Group topics logically and process in batches via ADF or Databricks job clusters
- Use cluster pools or autoscaling to manage resource cost and execution time
Operational Efficiency:
- Track processing status (offsets, run status, errors) in the control table
- Enable logging and alerting for failed topic runs
- Monitor via Databricks job runs or custom dashboards

I hope this information helps. Please do let us know if you have any further queries.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

Share via

EIGHT HUNDRED KAFKA TOPICS PROCESSING BY DBR

1 answer

Your answer