Hi @Janice Chi
Your scenario is common in high-throughput, micro-batch-based streaming pipelines using Databricks Structured Streaming with Kafka. Below is a structured, experience-backed answer addressing each concern, with field-proven strategies to prevent micro-batch data loss or duplication and ensure auditability.
Is micro-batch reprocessing common in high-throughput pipelines?
Yes. This behavior is by design in Structured Streaming’s at-least-once model:
Offsets are committed after the micro-batch is successfully processed and checkpointed.
- If the pipeline crashes after processing but before checkpointing, Spark will reprocess the same offsets on restart.
Frequency increases with:
High partition counts (hundreds, as in your case).
High throughput (25K events/sec).
- Sink performance bottlenecks (Azure SQL latency, network issues).
Best practices to proactively avoid loss or duplication
Use idempotent writes with MERGE(UPSERT) Logic
For Azure SQL sinks:
Avoid plain append mode, which leads to duplication.
- Use foreachBatch with a MERGE INTO or stored procedure to perform idempotent upserts
def upsert_to_sql(df, batch_id):
df.createOrReplaceTempView("updates")
upsert_query = """
MERGE target_table AS t
USING updates AS s
ON t.primary_key = s.primary_key
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...
"""
df._spark.sql(upsert_query)
Deduplication in-stream with watermarking
if the events had Unique ID + Timestamp:
df.withWatermark("event_time", "10 minutes") \
.dropDuplicates(["event_id"])
Tune Micro batch size
- Use maxOffsetsPerTrigger to control Kafka pull rate.
Ensure each batch is small enough to process within 10–30 seconds.
Use these Kafka parameters:
- spark.streaming.kafka.maxRatePerPartition
- startingOffsets = latest (unless recovery is needed)
Isolate write failures
- Avoid using .writeStream.format("jdbc"). JDBC drivers often don't support transactions well in streaming.
- Prefer staging to Delta, then bulk load to SQL using Spark or ADF if you need exactly-once semantics. I hope this information helps. Please do let us know if you have any further queries.
Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.
Thank you.