mpact of Kafka Partition Size on Databricks Streaming Performance When Writing to Azure SQL Hyperscale

Janice Chi 100 Reputation points
2025-06-10T16:08:12.71+00:00

n our project, we are using Databricks (not ADF) for both catch-up and real-time CDC ingestion from Kafka topics and writing the output directly to Azure SQL Hyperscale via JDBC. Some of our source Kafka topics (originating from DB2 CDC) may have large partition sizes, ranging from 2 TB to 10 TB per partition for high-volume tables.

We want to confirm:

Does the partition size of a Kafka topic impact the execution parallelism or task performance in Databricks? For example, if one Kafka partition holds 5+ TB of data, will that cause Spark to process it as a single task, leading to skew, executor memory pressure, or slow checkpointing?

Is there any guidance or documented best practice from Microsoft or Databricks on the recommended max size per Kafka partition, especially when writing directly to Azure SQL Hyperscale using Databricks?

Would Microsoft recommend any partition strategy (e.g., higher Kafka partition count per topic, size-based compaction, maxOffsetsPerTrigger tuning) to avoid bottlenecks in DBR-to-Hyperscale CDC pipelines?

We are looking for confirmation and optimization suggestions for high-throughput scenarios (5–10 TB daily ingestion per topic) using Databricks structured streaming, Kafka, and Hyperscale as the sink.

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,483 questions
{count} votes

1 answer

Sort by: Newest
  1. Chandra Boorla 13,790 Reputation points Microsoft External Staff Moderator
    2025-06-10T17:53:24.9966667+00:00

    @Janice Chi

    Thanks for outlining your architecture and questions, you're addressing a very important aspect of scaling streaming pipelines for high-throughput use cases.

    Based on your scenario (Databricks streaming from Kafka to Azure SQL Hyperscale), here are responses and guidance to your queries:

    Does Kafka partition size impact execution in Databricks?

    Yes, Kafka partition size directly impacts task execution and parallelism in Spark Structured Streaming. In Databricks:

    Each Kafka partition maps to one Spark task per micro-batch.

    If a single partition holds 5+ TB of data, Spark will attempt to process it as one task, which can result in:

    • Task skew, where one task takes significantly longer than others
    • Executor memory pressure or potential OOM (Out of Memory) issues
    • Slow checkpointing and longer batch intervals

    This can negatively impact streaming stability and throughput.

    Is there any documented best practice for Kafka partition size?

    While there's no strict maximum size per Kafka partition published by Microsoft or Databricks, community best practices (including those from Databricks, Confluent, and Kafka maintainers) suggest:

    • Keeping Kafka partition sizes small (typically ~1 GB to a few GBs per partition) to enable better parallelism and fault isolation.
    • Avoid allowing individual partitions to grow into the multi-terabyte range.

    Recommended partitioning and tuning strategies

    To support your ingestion volumes (5–10 TB/day/topic), here are some recommended strategies:

    Increase Kafka Partition Count

    • Ensure topics are sufficiently partitioned (ideally in the range of 50–200+ partitions for high-volume topics).
    • This enables Databricks to process data in parallel across multiple tasks and executors.

    Use maxOffsetsPerTrigger in Structured Streaming

    • Helps limit how much data Spark reads from Kafka per micro-batch, preventing large spikes in memory/processing time.
    • Tune this based on micro-batch latency targets and SQL Hyperscale write throughput.

    Implement Size-Based Compaction or Retention in Kafka

    • Helps prevent partitions from growing excessively over time.
    • Can reduce noise and stale records when working with CDC-type messages.

    Additional recommendations for Azure SQL Hyperscale as Sink

    JDBC Sink Optimization:

    Use foreachBatch() to gain fine-grained control over how data is written.

    Partition the DataFrame by a meaningful key (e.g., date, ID) before writing to enable parallel inserts.

    Tune JDBC options such as:

    • batchsize
    • numPartitions
    • Connection pool parameters

    Intermediate Delta Layer (Optional but Recommended):

    • Consider writing streaming data to a Delta Lake staging table first.
    • From there, use a separate batch job to write to Hyperscale, this improves fault tolerance, scalability, and simplifies retries if the JDBC sink experiences issues.

    Maintain Azure SQL Hyperscale:

    • Periodic index/statistics updates and partition maintenance on Hyperscale help sustain write performance during high ingestion periods.

    Conclusion:

    Yes, Kafka partition size significantly affects parallelism and performance.
    Use smaller, more numerous partitions to scale effectively in Databricks.
    Tune ingestion with maxOffsetsPerTrigger, foreachBatch, and JDBC parameters.
    Consider decoupling ingestion using Delta staging before writing to Azure SQL.

    I hope this information helps. Please do let us know if you have any further queries.

    Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

    Thank you.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.