Retry and Failure Handling Strategy for CDC Merge Pipeline from Kafka to Databricks and Hyperscale

Janice Chi 100 Reputation points
2025-06-14T18:28:28.0433333+00:00

In our CDC ingestion architecture, we are processing incremental changes( 3000-30,000 events/sec) , 800 topics for 800 tables from IBM DB2 using Kafka topics (via IBM InfoSphere CDC), with the following two stages:

Kafka to Databricks Silver Layer: We use a MERGE-based approach to apply CDC records (I/U/D) onto the Silver layer Delta tables, based on a previously loaded historical snapshot.

Databricks Silver to Azure SQL Hyperscale: Post Silver layer merge, we ingest impacted records into Azure SQL Hyperscale using staging + merge logic.

We maintain control tables that capture metadata like topic, partition, start_offset, end_offset, run_id, status, and error_msg.

any other metadata required ?

our plan is to merge cdc data with historical data at offset level and aslo need to perform when all cdc completed at offset levels unlike per microbatch levels which w ew ill do in next sstage once CDC compleed

Our questions are:

1.What are the most common reasons due to which a MERGE INTO operation from Kafka to Silver Delta tables may fail, especially for large tables or evolving CDC schemas? Are there recommended patterns to make the retry logic fully idempotent at the offset level?

2.When a failure occurs during Silver layer ingestion (e.g., schema conflict, cluster timeout, partial write), can we reliably reprocess the same Kafka topic-partition-offset range without data corruption or duplication? Is there a way to mark offset-level ingestion as atomic?

3.For the next stage (Silver to Hyperscale), what are common failure points (e.g., JDBC write timeout, merge constraint violation), and what is Microsoft’s recommendation for implementing safe retries when the batch load fails for a specific run_id?

4.Is there a recommended control table pattern or status model to differentiate transient, logic, or schema-related failures—so that we can design automated retry vs. manual intervention pipelines accordingly?

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,483 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Chiugo Okpala 1,740 Reputation points MVP
    2025-06-15T12:19:21.0733333+00:00

    @Janice Chi welcome to the Microsoft Q&A community.

    Your CDC ingestion pipeline is quite robust, and ensuring reliable retries and failure handling is crucial for maintaining data integrity. Here are some insights based on best practices:

    1. Common Reasons for MERGE INTO Failures in Kafka to Silver Delta Tables

    Schema Evolution Issues: If new columns are introduced or data types change, the MERGE operation may fail.

    Concurrency Conflicts: High ingestion rates can lead to concurrent updates, causing deadlocks or race conditions.

    Cluster Resource Constraints: Large tables may cause memory or compute exhaustion, leading to timeouts.

    Data Skew: Uneven distribution of data across partitions can lead to performance bottlenecks.

    Idempotent Retry Patterns: Using offset-based tracking ensures that retries do not duplicate data. Implementing checkpointing at the offset level can help.

    1. Reliable Reprocessing Without Data Corruption

    Atomicity at Offset Level: Using Delta Lake transaction logs ensures that each offset range is processed atomically.

    Partitioned Processing: Reprocessing should be done at the Kafka topic-partition-offset level to avoid duplication.

    Schema Validation Before Merge: Running schema checks before ingestion can prevent failures due to unexpected changes.

    1. Failure Points in Silver to Hyperscale Stage

    JDBC Write Timeouts: Large batch writes may exceed timeout limits.

    Merge Constraint Violations: Primary key conflicts or missing dependencies can cause failures.

    • Safe Retry Strategies:

    Implement batch-level retries with exponential backoff.

    Use staging tables to validate data before merging into Hyperscale.

    • Maintain run_id-based tracking to ensure failed batches can be retried safely.Your CDC ingestion pipeline is quite robust, and ensuring reliable retries and failure handling is crucial for maintaining data integrity. Here are some insights based on best practices:
      1. Common Reasons for MERGE INTO Failures in Kafka to Silver Delta Tables
      • Schema Evolution Issues: If new columns are introduced or data types change, the MERGE operation may fail.
      • Concurrency Conflicts: High ingestion rates can lead to concurrent updates, causing deadlocks or race conditions.
      • Cluster Resource Constraints: Large tables may cause memory or compute exhaustion, leading to timeouts.
      • Data Skew: Uneven distribution of data across partitions can lead to performance bottlenecks.
      • Idempotent Retry Patterns: Using offset-based tracking ensures that retries do not duplicate data. Implementing checkpointing at the offset level can help.
      1. Reliable Reprocessing Without Data Corruption
      • Atomicity at Offset Level: Using Delta Lake transaction logs ensures that each offset range is processed atomically.
      • Partitioned Processing: Reprocessing should be done at the Kafka topic-partition-offset level to avoid duplication.
      • Schema Validation Before Merge: Running schema checks before ingestion can prevent failures due to unexpected changes.
      1. Failure Points in Silver to Hyperscale Stage
      • JDBC Write Timeouts: Large batch writes may exceed timeout limits.
      • Merge Constraint Violations: Primary key conflicts or missing dependencies can cause failures.
      • Safe Retry Strategies:
        • Implement batch-level retries with exponential backoff.
        • Use staging tables to validate data before merging into Hyperscale.
        • Maintain run_id-based tracking to ensure failed batches can be retried safely.
    1. Recommended Control Table Patterns

    Status Model for Failures:

    Transient Failures (e.g., network issues) → Auto-retry.

      **Logical Failures** (e.g., data validation errors) → Flag for manual review.
      
         **Schema-related Failures** → Require intervention before retry.
         
         **Metadata Enhancements**:
         
            Include **error category** (e.g., timeout, schema mismatch).
            
               Track **retry attempts** to prevent infinite loops.
               
                  Maintain **audit logs** for debugging.
                  
    

    For more details, you can check this resource on retry and failure handling strategies. Let me know if you need further clarification.

    I hope these helps. Let me know if you have any further questions or need additional assistance.

    Also if these answers your query, do click the "Upvote" and click "Accept the answer" of which might be beneficial to other community members reading this thread.

    User's image


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.