Thanks for outlining your architecture and questions, you're addressing a very important aspect of scaling streaming pipelines for high-throughput use cases.
Based on your scenario (Databricks streaming from Kafka to Azure SQL Hyperscale), here are responses and guidance to your queries:
Does Kafka partition size impact execution in Databricks?
Yes, Kafka partition size directly impacts task execution and parallelism in Spark Structured Streaming. In Databricks:
Each Kafka partition maps to one Spark task per micro-batch.
If a single partition holds 5+ TB of data, Spark will attempt to process it as one task, which can result in:
- Task skew, where one task takes significantly longer than others
- Executor memory pressure or potential OOM (Out of Memory) issues
- Slow checkpointing and longer batch intervals
This can negatively impact streaming stability and throughput.
Is there any documented best practice for Kafka partition size?
While there's no strict maximum size per Kafka partition published by Microsoft or Databricks, community best practices (including those from Databricks, Confluent, and Kafka maintainers) suggest:
- Keeping Kafka partition sizes small (typically ~1 GB to a few GBs per partition) to enable better parallelism and fault isolation.
- Avoid allowing individual partitions to grow into the multi-terabyte range.
Recommended partitioning and tuning strategies
To support your ingestion volumes (5–10 TB/day/topic), here are some recommended strategies:
Increase Kafka Partition Count
- Ensure topics are sufficiently partitioned (ideally in the range of 50–200+ partitions for high-volume topics).
- This enables Databricks to process data in parallel across multiple tasks and executors.
Use maxOffsetsPerTrigger
in Structured Streaming
- Helps limit how much data Spark reads from Kafka per micro-batch, preventing large spikes in memory/processing time.
- Tune this based on micro-batch latency targets and SQL Hyperscale write throughput.
Implement Size-Based Compaction or Retention in Kafka
- Helps prevent partitions from growing excessively over time.
- Can reduce noise and stale records when working with CDC-type messages.
Additional recommendations for Azure SQL Hyperscale as Sink
JDBC Sink Optimization:
Use foreachBatch()
to gain fine-grained control over how data is written.
Partition the DataFrame by a meaningful key (e.g., date, ID) before writing to enable parallel inserts.
Tune JDBC options such as:
-
batchsize
-
numPartitions
- Connection pool parameters
Intermediate Delta Layer (Optional but Recommended):
- Consider writing streaming data to a Delta Lake staging table first.
- From there, use a separate batch job to write to Hyperscale, this improves fault tolerance, scalability, and simplifies retries if the JDBC sink experiences issues.
Maintain Azure SQL Hyperscale:
- Periodic index/statistics updates and partition maintenance on Hyperscale help sustain write performance during high ingestion periods.
Conclusion:
Yes, Kafka partition size significantly affects parallelism and performance.
Use smaller, more numerous partitions to scale effectively in Databricks.
Tune ingestion with maxOffsetsPerTrigger
, foreachBatch
, and JDBC parameters.
Consider decoupling ingestion using Delta staging before writing to Azure SQL.
I hope this information helps. Please do let us know if you have any further queries.
Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.
Thank you.