mpact of Kafka Partition Size on Databricks Streaming Performance When Writing to Azure SQL Hyperscale

Question

mpact of Kafka Partition Size on Databricks Streaming Performance When Writing to Azure SQL Hyperscale

Janice Chi 100

n our project, we are using Databricks (not ADF) for both catch-up and real-time CDC ingestion from Kafka topics and writing the output directly to Azure SQL Hyperscale via JDBC. Some of our source Kafka topics (originating from DB2 CDC) may have large partition sizes, ranging from 2 TB to 10 TB per partition for high-volume tables.

We want to confirm:

Does the partition size of a Kafka topic impact the execution parallelism or task performance in Databricks? For example, if one Kafka partition holds 5+ TB of data, will that cause Spark to process it as a single task, leading to skew, executor memory pressure, or slow checkpointing?

Is there any guidance or documented best practice from Microsoft or Databricks on the recommended max size per Kafka partition, especially when writing directly to Azure SQL Hyperscale using Databricks?

Would Microsoft recommend any partition strategy (e.g., higher Kafka partition count per topic, size-based compaction, maxOffsetsPerTrigger tuning) to avoid bottlenecks in DBR-to-Hyperscale CDC pipelines?

We are looking for confirmation and optimization suggestions for high-throughput scenarios (5–10 TB daily ingestion per topic) using Databricks structured streaming, Kafka, and Hyperscale as the sink.

Smaran Thoomu 24,015 Reputation points Microsoft External Staff Moderator

2025-06-11T03:53:08.1+00:00

@Janice Chi We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Smaran Thoomu 24,015 Reputation points Microsoft External Staff Moderator

2025-06-16T01:23:08.9133333+00:00

@Janice Chi Following up to see if the below answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

1 answer

Your answer

Smaran Thoomu 24,015 Reputation points Microsoft External Staff Moderator

2025-06-11T03:53:08.1+00:00

@Janice Chi We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Smaran Thoomu 24,015 Reputation points Microsoft External Staff Moderator

2025-06-16T01:23:08.9133333+00:00

@Janice Chi Following up to see if the below answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Answer 1

@Janice Chi

Thanks for outlining your architecture and questions, you're addressing a very important aspect of scaling streaming pipelines for high-throughput use cases.

Based on your scenario (Databricks streaming from Kafka to Azure SQL Hyperscale), here are responses and guidance to your queries:

Does Kafka partition size impact execution in Databricks?

Yes, Kafka partition size directly impacts task execution and parallelism in Spark Structured Streaming. In Databricks:

Each Kafka partition maps to one Spark task per micro-batch.

If a single partition holds 5+ TB of data, Spark will attempt to process it as one task, which can result in:

Task skew, where one task takes significantly longer than others
Executor memory pressure or potential OOM (Out of Memory) issues
Slow checkpointing and longer batch intervals

This can negatively impact streaming stability and throughput.

Is there any documented best practice for Kafka partition size?

While there's no strict maximum size per Kafka partition published by Microsoft or Databricks, community best practices (including those from Databricks, Confluent, and Kafka maintainers) suggest:

Keeping Kafka partition sizes small (typically ~1 GB to a few GBs per partition) to enable better parallelism and fault isolation.
Avoid allowing individual partitions to grow into the multi-terabyte range.

Recommended partitioning and tuning strategies

To support your ingestion volumes (5–10 TB/day/topic), here are some recommended strategies:

Increase Kafka Partition Count

Ensure topics are sufficiently partitioned (ideally in the range of 50–200+ partitions for high-volume topics).
This enables Databricks to process data in parallel across multiple tasks and executors.

Use maxOffsetsPerTrigger in Structured Streaming

Helps limit how much data Spark reads from Kafka per micro-batch, preventing large spikes in memory/processing time.
Tune this based on micro-batch latency targets and SQL Hyperscale write throughput.

Implement Size-Based Compaction or Retention in Kafka

Helps prevent partitions from growing excessively over time.
Can reduce noise and stale records when working with CDC-type messages.

Additional recommendations for Azure SQL Hyperscale as Sink

JDBC Sink Optimization:

Use foreachBatch() to gain fine-grained control over how data is written.

Partition the DataFrame by a meaningful key (e.g., date, ID) before writing to enable parallel inserts.

Tune JDBC options such as:

batchsize
numPartitions
Connection pool parameters

Intermediate Delta Layer (Optional but Recommended):

Consider writing streaming data to a Delta Lake staging table first.
From there, use a separate batch job to write to Hyperscale, this improves fault tolerance, scalability, and simplifies retries if the JDBC sink experiences issues.

Maintain Azure SQL Hyperscale:

Periodic index/statistics updates and partition maintenance on Hyperscale help sustain write performance during high ingestion periods.

Conclusion:

Yes, Kafka partition size significantly affects parallelism and performance.
Use smaller, more numerous partitions to scale effectively in Databricks.
Tune ingestion with maxOffsetsPerTrigger, foreachBatch, and JDBC parameters.
Consider decoupling ingestion using Delta staging before writing to Azure SQL.

I hope this information helps. Please do let us know if you have any further queries.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

Thank you.

Janice Chi 100 Reputation points

2025-06-13T08:43:16.0433333+00:00

what all information from kafka and about Kafka topics / partitions we should have before making stratgey to ingest them in catchup/cdc and streaming phase ?

Share via

mpact of Kafka Partition Size on Databricks Streaming Performance When Writing to Azure SQL Hyperscale

1 answer

Your answer