Hi @Janice Chi
Can the same Kafka offset yield different content over time due to compaction or CDC tool behavior?
Kafka offsets themselves are immutable, but InfoSphere CDC can replay events differently (e.g., turning an update into a delete + insert). Also, if compaction is enabled, old messages may be discarded or merged.Never rely on Kafka offsets alone for reconciliation. Persist messages into the Delta Bronze layer immediately.
If we ingest Kafka offset range 1–100 now, is it safe to reconcile 6 hours later using those offsets?
No. Due to Kafka’s retention settings or CDC tool behavior, those offsets may be unavailable or semantically different (e.g., re-processed with schema changes).Perform reconciliation on Delta snapshots, not on raw Kafka streams.
Is Kafka a reliable source for recon between ingestion and target DB?
No. Kafka is not a system of record it is transient and subject to retention, replay, and compaction issues.Use Delta Bronze or Silver as the source of truth. Optionally, leverage control logs or audit tables from InfoSphere CDC.
https://learn.microsoft.com/en-us/azure/databricks/structured-streaming/delta-lake
How do we preserve an accurate snapshot for reconciliation if Kafka evolves?
As soon as the data lands in Bronze, assign it a run ID, and store it in a partitioned folder (e.g., run_id=20240612_1000). Include Kafka metadata (offset range, timestamp, topic, etc.).Track runs and store metadata in a control table for recon and auditability.
Can we map Kafka events to DB partitions using payload fields?
Yes, extract logical partition keys (e.g., client_id, year_month) from Kafka payload and write them to Bronze/Silver. Don’t rely on Kafka’s physical partition numbers.
Should we keep the Bronze data after ingestion?
Yes retain it until reconciliation with Hyperscale is successful. If recon fails or Kafka is purged, Bronze acts as the rollback point.
To manage data retention effectively in your CDC pipeline, use a control table to track the recon_status of each ingestion run (e.g., PENDING, IN_PROGRESS, FAILED, SUCCESS). Only when the status is marked as 'SUCCESS'indicating that reconciliation with the target system is complete should you automatically delete the corresponding Delta Bronze data. This ensures traceability and provides a rollback point in case of reconciliation failures.
Should recon be done after the MERGE INTO from Silver to SQL completes?
Yes ,final reconciliation should validate data after merge into the target, though intermediate stage validation (Bronze → Silver) is also valuable.
https://learn.microsoft.com/en-us/azure/databricks/delta/merge
I hope this information helps. Please do let us know if you have any further queries.
Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.
Thank you.