Hi @Janice Chi
As I understand you're dealing with a complex but very common pattern in large-scale CDC pipelines. Below are suggestions to address your three questions:
Do we need to process each topic individually?
No, not manually. While each Kafka topic has a distinct schema, you don’t need 800 separate notebooks. Instead, you can implement a metadata-driven framework using your control table to drive logic dynamically. Each topic can be processed using a generic notebook or pipeline that:
- Reads parameters from the control table (e.g., topic name, source/target, schema details, offset/checkpoint info)
- Dynamically applies schema and logic using Spark’s capabilities (like
from_json
,merge
, etc.)
Best practice for scalable and maintainable catch-up CDC (offset-based)?
For catch-up mode (after historical load), here are some best practices:
- Use batch Structured Streaming in micro-batch mode by setting
startingOffsets
andendingOffsets
based on your control table - Dynamically generate the schema from schema registry or embedded schema (if available), or predefine schemas in your metadata table
- Use a single notebook that loops through the control table entries, processes each topic in parallel using
concurrent.futures
,multithreading
, or Databricks workflows with task parallelism - Store raw flattened JSON to Bronze, then apply consistent logic in Silver using
MERGE
driven by metadata (e.g., primary keys, op type)
Best practice for scalable NRT (watermark/checkpoint-based)?
In Near Real-Time mode:
- Use Structured Streaming with checkpointing and watermarking
- Again, drive everything from metadata - let one notebook or job process topics dynamically
- Partition your load by business unit or ___domain to optimize performance (e.g., 10–15 topics per cluster/job)
- Consider using Autoloader + trigger-once mode for frequent NRT updates, depending on latency needs
Operational Efficiency Recommendations
- Avoid 800 notebooks - aim for 1-2 parameterized notebooks for Bronze/Silver layers
- Maintain separate cluster pools or use autoscaling for different topic groups to parallelize
- Monitor pipeline health using logging in the control table (e.g., last processed offset, status, error message)
- Use ADF pipelines or Databricks Workflows to orchestrate batches by group.
Optimization Tips for Schema Variation, Parallelism, and Operational Efficiency
- Schema Handling: Define schema ___location/type in control table. You can use:
- JSON schema stored in ADLS per topic
- Auto schema inference with constraints
- Confluent schema registry (if used)
- Parallelism:
- Group topics logically and process in batches via ADF or Databricks job clusters
- Use cluster pools or autoscaling to manage resource cost and execution time
- Operational Efficiency:
- Track processing status (offsets, run status, errors) in the control table
- Enable logging and alerting for failed topic runs
- Monitor via Databricks job runs or custom dashboards
I hope this information helps. Please do let us know if you have any further queries.
Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.