CDC KAFKA recon strategy

Janice Chi 100 Reputation points
2025-06-12T13:38:05.87+00:00

We are implementing a CDC-based ingestion using IBM InfoSphere CDC pushing data to Kafka. Downstream, the data is consumed in Databricks, processed, and written to Azure SQL Hyperscale. We use run-wise ingestion into the Bronze layer and perform a merge-based load into Hyperscale. We want to implement reliable row-count and hash-based reconciliation during CDC.


Questions

Kafka Offset Mutability and Recon Validity Once a CDC event is published to Kafka (e.g., at 10 a.m.), is it technically possible that the same offset within a Kafka partition (say, offset 1–100) later yields different content due to downstream compaction or replay changes from the source CDC tool?

Recon Timing and Consistency If we process offset range 1–100 into Hyperscale now, and then run reconciliation 6 hours later — is it safe to expect that offset 1–100 still refers to the exact same records in Kafka? Or is there a risk that the message content might have changed (e.g., update now appears as delete-insert)?

Source of Truth for Reconciliation Given that Kafka offsets may evolve or change content over time (due to compaction, retention, or reordering in CDC tools), is it a good practice to do CDC reconciliation directly between Kafka and Hyperscale? If not Kafka, what is the recommended authoritative source for recon in CDC setups?

Recon Granularity and Snapshot Timing In our project, we plan to reconcile at run-level (i.e., per ingestion window). What is the best practice to freeze the expected record snapshot (from Kafka or Delta Bronze) for reconciliation when the data in Kafka might evolve?

Dependency Between Kafka Partition and DB Partition During CDC processing, we receive records by Kafka offset ranges and write to Delta Bronze partitioned by logical keys (like client ID, year_month). For merge and reconciliation, we infer DB partitions from Kafka message payload. Is this dependency mapping (Kafka → logical column → DB partition) sufficient and accurate?

Retention in Bronze Layer Until Recon Completes If we do not trust Kafka for recon, should we retain merged Delta Bronze data (used as input for Hyperscale write) until the reconciliation with Hyperscale is completed? If yes, what should be the retention policy or pattern for managing that layer?

Merge Timing and Recon Window Coordination Since we perform MERGE INTO operations using CDC Delta batches, should reconciliation be performed after MERGE is fully complete, or is it recommended to validate intermediate stages as well (Bronze to Silver, then Silver to Hyperscale)?

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,492 questions
{count} votes

1 answer

Sort by: Most helpful
  1. J N S S Kasyap 3,300 Reputation points Microsoft External Staff Moderator
    2025-06-12T15:05:43.8633333+00:00

    Hi @Janice Chi

    Can the same Kafka offset yield different content over time due to compaction or CDC tool behavior?

    Kafka offsets themselves are immutable, but InfoSphere CDC can replay events differently (e.g., turning an update into a delete + insert). Also, if compaction is enabled, old messages may be discarded or merged.Never rely on Kafka offsets alone for reconciliation. Persist messages into the Delta Bronze layer immediately.

    If we ingest Kafka offset range 1–100 now, is it safe to reconcile 6 hours later using those offsets?

    No. Due to Kafka’s retention settings or CDC tool behavior, those offsets may be unavailable or semantically different (e.g., re-processed with schema changes).Perform reconciliation on Delta snapshots, not on raw Kafka streams. 

    Is Kafka a reliable source for recon between ingestion and target DB?

    No. Kafka is not a system of record it is transient and subject to retention, replay, and compaction issues.Use Delta Bronze or Silver as the source of truth. Optionally, leverage control logs or audit tables from InfoSphere CDC.

    https://learn.microsoft.com/en-us/azure/databricks/structured-streaming/delta-lake

    How do we preserve an accurate snapshot for reconciliation if Kafka evolves?

    As soon as the data lands in Bronze, assign it a run ID, and store it in a partitioned folder (e.g., run_id=20240612_1000). Include Kafka metadata (offset range, timestamp, topic, etc.).Track runs and store metadata in a control table for recon and auditability. 

    Can we map Kafka events to DB partitions using payload fields?

    Yes, extract logical partition keys (e.g., client_id, year_month) from Kafka payload and write them to Bronze/Silver. Don’t rely on Kafka’s physical partition numbers.

    Should we keep the Bronze data after ingestion?

    Yes  retain it until reconciliation with Hyperscale is successful. If recon fails or Kafka is purged, Bronze acts as the rollback point.

    To manage data retention effectively in your CDC pipeline, use a control table to track the recon_status of each ingestion run (e.g., PENDING, IN_PROGRESS, FAILED, SUCCESS). Only when the status is marked as 'SUCCESS'indicating that reconciliation with the target system is complete should you automatically delete the corresponding Delta Bronze data. This ensures traceability and provides a rollback point in case of reconciliation failures. 

    Should recon be done after the MERGE INTO from Silver to SQL completes?

    Yes ,final reconciliation should validate data after merge into the target, though intermediate stage validation (Bronze → Silver) is also valuable.

    https://learn.microsoft.com/en-us/azure/databricks/delta/merge

    I hope this information helps. Please do let us know if you have any further queries.

    Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

    Thank you.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.