Kafka Partitionings vs DB partitions

Question

Kafka Partitionings vs DB partitions

Janice Chi 100

We are working on a large-scale CDC ingestion pipeline after completion of One time historicsl Migration where we have already imported 80 TB of data vi ADF to bronze layer

where:

Source: IBM DB2 (on-prem)

CDC Tool: IBM InfoSphere CDC publishes to Kafka (1 topic = 1 table)

Ingestion: Databricks reads from Kafka and writes to Bronze and Silver layers (Delta format)

Transformation: MERGE INTO is used for upsert logic based on offsets

Historical Data: Already migrated to Bronze layer using FlashCopy-based method, including all partitions

now We are now implementing the CDC catch-up and streaming ingestion using Kafka. Each Kafka topic has multiple partitions, and each partition contains a sequence of offsets. We maintain a control table with metadata at topic-partition-offset granularity.

We have two specific questions regarding the relationship between Kafka partitioning and target Delta partitioning in our setup:

Q1. If a large table is pre-partitioned in Bronze (e.g., 10 physical partitions), and this table's changes are captured via a single Kafka topic with 10 partitions — Does the number of Kafka partitions always align with the number of DB partitions or Delta table partitions? Can we rely on this alignment, or is this mapping independent and arbitrary from Kafka producer side?

Q2. Suppose we are reading Topic 1, Partition 1, Offset range 1 to 100: We know this message range belongs to Topic 1 → Table X and Partition 1 of Kafka.

But during the MERGE INTO operation in the Silver layer:

How can we identify which specific table partition the rows in this offset range belong to?

Since partitioning in the Delta table (e.g., based on a column like claim_month) is logical and Kafka partitions are physical, what’s the best practice to correlate Kafka message offsets to Delta partitions and rows, so that MERGE operations are efficient and do not scan the entire table?

J N S S Kasyap 3,300 Reputation points Microsoft External Staff Moderator

2025-06-12T09:35:25.58+00:00

@Janice Chi We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Janice Chi 100 Reputation points

2025-06-12T15:43:05.4766667+00:00

what all we should know from Kafka and at what level before making strategy to connect DBR to kafka -- Can you please give me sample of Kafka messsages /payloads and what not which all is must to know before consuming kafka topics and flatten them - theer are multiple operatios I , U and D and may be other things which we should know what all are they ?

J N S S Kasyap 3,300 Microsoft External Staff Moderator

@Janice Chi
Before consuming Kafka topics in a CDC ingestion pipeline using Databricks, it's essential to understand what the Kafka messages contain, how to flatten and interpret them, and which CDC operations (Insert, Update, Delete) must be handled correctly.

Kafka Message Structure

Each Kafka record usually includes:

key: Often contains a primary key (optional, used for partitioning)
value: Main payload (CDC data in JSON/Avro)
headers: Optional metadata
offset: Message ID within the partition
partition, topic: Kafka routing metadata
timestamp: When the event was published

CDC Payload Sample

A typical Kafka message from a CDC tool (like IBM InfoSphere CDC or Debezium) looks like:

{
  "operation": "U",   // Can be "I" (Insert), "U" (Update), "D" (Delete)
  "table": "claims",
  "before": {
    "claim_id": "123",
    "claim_month": "2025-01",
    "status": "open"
  },
  "after": {
    "claim_id": "123",
    "claim_month": "2025-01",
    "status": "closed"
  },
  "timestamp": "2025-06-13T09:10:22Z"
}

Explanation:

operation: Indicates the change type — Insert (I), Update (U), Delete (D).
before: Old values (used in update/delete).
after: New values (used in insert/update).
table: Source table name.
timestamp: Time of the change at the source.

3.Operations You Must Support


Code	Operation	Flattening Rule
"I"	Insert	Use after block only
"U"	Update	Compare before vs after; use after for current values
"D"	Delete	Use before block to delete in sink
"R" or "S"	Snapshot/Reload	Treated like insert, validate before using

4.Schema Handling

Use a Schema Registry if available (Avro/JSON schemas).
Track schema versions if your CDC tool supports evolution.
Always validate that the before/after fields match your expected schema.

Flattening CDC Messages in Databricks

Example Spark (PySpark) logic:

from pyspark.sql.functions import col, from_json, when

schema = "claim_id STRING, claim_month STRING, status STRING"

raw_df = spark.readStream.format("kafka") \
    .option("kafka.bootstrap.servers", "host:9092") \
    .option("subscribe", "claims") \
    .load()

parsed_df = raw_df.selectExpr("CAST(value AS STRING) as json_str") \
    .select(from_json(col("json_str"), f"""
        operation STRING,
        before STRUCT<{schema}>,
        after STRUCT<{schema}>,
        table STRING,
        timestamp STRING
    """).alias("data")) \
    .select(
        col("data.operation"),
        when(col("data.operation") == "D", col("data.before.claim_id")).otherwise(col("data.after.claim_id")).alias("claim_id"),
        when(col("data.operation") == "D", col("data.before.claim_month")).otherwise(col("data.after.claim_month")).alias("claim_month"),
        when(col("data.operation") == "D", col("data.before.status")).otherwise(col("data.after.status")).alias("status"),
        col("data.timestamp")
    )

Important considerations

Partitioning: Understand if Kafka uses primary key-based partitioning or round-robin.
Offset Tracking: Always persist Kafka offset, topic, and partition for traceability and recovery.
Idempotency: Design your MERGE/UPSERT logic to be idempotent using primary keys and operation type.
Error Handling: Capture and store malformed or failed CDC events in a dead-letter queue or error table.

With these best practices, you'll build a reliable, scalable CDC pipeline from Kafka into Databricks that handles all edge cases efficiently.

Janice Chi 100 Reputation points

2025-06-14T11:54:32.26+00:00
Source: IBM DB2 (on-prem)

CDC Tool: IBM InfoSphere CDC publishes to Kafka (1 topic = 1 table)

Ingestion: Databricks reads from Kafka and writes to Bronze and Silver layers (Delta format)

Transformation: MERGE INTO is used for upsert logic based on offsets

Historical Data: Already migrated to Bronze layer using FlashCopy-based method, including all partitions---------------------------------------------------------------------------------------------------------------------------during Historical Data if there are duplicate rows on the basis of primary keys etc then we probably do not have any isues as we have to insert and recon o the basis of row count and hash , if we remove duplicates then row count mismatch might be there and hash values mismatch , is this corrcet ?

but if we do not remove duplicates then while "MERGE INTO" cdc data with original data for I/U what we should do ?

1 answer

Your answer

J N S S Kasyap 3,300 Reputation points Microsoft External Staff Moderator

2025-06-12T09:35:25.58+00:00

@Janice Chi We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Janice Chi 100 Reputation points

2025-06-12T15:43:05.4766667+00:00

what all we should know from Kafka and at what level before making strategy to connect DBR to kafka -- Can you please give me sample of Kafka messsages /payloads and what not which all is must to know before consuming kafka topics and flatten them - theer are multiple operatios I , U and D and may be other things which we should know what all are they ?
Janice Chi 100 Reputation points

2025-06-14T11:54:32.26+00:00

Source: IBM DB2 (on-prem)

CDC Tool: IBM InfoSphere CDC publishes to Kafka (1 topic = 1 table)

Ingestion: Databricks reads from Kafka and writes to Bronze and Silver layers (Delta format)

Transformation: MERGE INTO is used for upsert logic based on offsets

Historical Data: Already migrated to Bronze layer using FlashCopy-based method, including all partitions---------------------------------------------------------------------------------------------------------------------------during Historical Data if there are duplicate rows on the basis of primary keys etc then we probably do not have any isues as we have to insert and recon o the basis of row count and hash , if we remove duplicates then row count mismatch might be there and hash values mismatch , is this corrcet ?

but if we do not remove duplicates then while "MERGE INTO" cdc data with original data for I/U what we should do ?

Answer 1

Hi @Janice Chi

If a large table is pre-partitioned in Bronze (e.g., 10 physical partitions), and this table's changes are captured via a single Kafka topic with 10 partitions — Does the number of Kafka partitions always align with the number of DB partitions or Delta table partitions? Can we rely on this alignment, or is this mapping independent and arbitrary from Kafka producer side?

The number of Kafka partitions does not inherently align with the number of database (e.g., IBM DB2) or Delta table partitions (e.g., in the Bronze layer). Kafka partitions are physical divisions managed by the producer’s partitioning strategy (e.g., key-based, round-robin, or custom), while Delta table partitions are logical, typically based on a column like claim_month. Similarly, DB2 partitions are defined by the database’s partitioning scheme, which may not match the Kafka producer’s logic.

In your case, where IBM InfoSphere CDC publishes changes to Kafka (1 topic = 1 table), the assignment of messages to Kafka partitions depends on the CDC tool’s partitioning strategy. Unless explicitly configured to use a partitioning key that mirrors the Delta table’s or DB2’s partitioning scheme (e.g., claim_month), the mapping between Kafka partitions and DB/Delta partitions is arbitrary and independent.

You cannot rely on an automatic alignment between Kafka partitions and DB/Delta table partitions. To achieve alignment, you must:

Confirm or configure the Kafka producer (InfoSphere CDC) to partition messages using a key that matches the Delta table’s partitioning column (e.g., claim_month).
Validate this by inspecting Kafka partition data to ensure messages consistently map to corresponding Delta partitions.

Without such configuration, treat Kafka partitions as independent of DB/Delta partitions and design your pipeline to handle this mismatch, such as by extracting the partitioning column from message payloads during processing.

How can we efficiently identify and target the relevant Delta table partitions (e.g., claim_month) during a MERGE INTO operation, given that Kafka partitions and offsets (e.g., Topic 1, Partition 1, Offsets 1–100) are physical and do not directly map to the logical partitioning used in Delta Lake?

To correlate Kafka message offsets (e.g., Topic 1, Partition 1, Offset 1–100) with Delta table partitions (e.g., partitioned by claim_month) in the Silver layer for efficient MERGE INTO operations, you need to map Kafka messages to Delta partitions without scanning the entire table. Here’s a concise best practice approach

Best Practices: Correlate Kafka Offsets to Delta Partitions

Include Partition Key in Kafka Messages Ensure Kafka messages contain the Delta table’s partition column (e.g., claim_month) in the payload or key. For example, a JSON message might include {"claim_id": "123", "claim_month": "2025-01", ...}. This allows mapping messages to Delta partitions like claim_month=2025-01.

Extract Partition key in spark Use Spark Structured Streaming to read Kafka messages and extract claim_month.

   from pyspark.sql.functions import col, from_json
   # Read Kafka stream
   kafka_df = spark.readStream \
   .format("kafka") \
   .option("kafka.bootstrap.servers", "broker:port") \
   .option("subscribe", "Topic1") \
   .option("startingOffsets", """{"Topic1":{"1":1}}""") \
   .option("endingOffsets", """{"Topic1":{"1":100}}""") \
   .load()
   # Parse JSON and extract claim_month
   schema = "claim_id STRING, claim_month STRING, ..."
   parsed_df = kafka_df.selectExpr("CAST(value AS STRING) as json_value") \
   .select(from_json(col("json_value"), schema).alias("data")) \
   .select("data.*", col("offset").alias("kafka_offset"), col("partition").alias("kafka_partition"))

3.Group by Partition key
Identify distinct claim_month values to determine affected Delta partitions.

   partition_values = parsed_df.select("claim_month").distinct().collect()
   partition_values = [row["claim_month"] for row in partition_values]

4.Optimize MERGE INTO with Partition Filtering: Perform MERGE INTO for each claim_month to target specific Delta partitions, avoiding full table scans.

python
from delta.tables import DeltaTable

delta_table = DeltaTable.forPath(spark, "/path/to/silver/tableX")
for partition_value in partition_values:
    partition_df = parsed_df.filter(col("claim_month") == partition_value)
    delta_table.alias("target") \
        .merge(
            partition_df.alias("source"),
            f"target.claim_month = '{partition_value}' AND target.claim_id = source.claim_id"
        ) \
        .whenMatchedUpdateAll() \
        .whenNotMatchedInsertAll() \
        .execute()

5.Track Kafka metadata

Store kafka_offset, kafka_partition, and kafka_topic in the Delta table for traceability. Example Schema:

ColumnTypeDescriptionclaim_idStringUnique row identifierclaim_monthStringPartition key (e.g., 2025-01)......Other data fieldskafka_offsetLongKafka message offsetkafka_partitionintKafka partition numberkafka_topicStringKafka topic name

6.Use Checkpointing
For streaming, use Spark checkpointing to track processed offsets.

query = parsed_df.writeStream \
    .format("delta") \
    .option("checkpointLocation", "/path/to/checkpoint") \
    .trigger(availableNow=True) \
    .start("/path/to/silver/tableX")

To enhance query performance on the Delta table, apply Z-Order indexing on the claim_month and claim_id columns. This can be done using the command delta_table.optimize().executeZOrderBy("claim_month", "claim_id") in Spark. Z-Order indexing improves data skipping and query efficiency by clustering related data, making MERGE INTO operations faster, especially for large tables.

Maintaining appropriate partition granularity, such as monthly partitions (e.g., claim_month=2025-01), is also critical to balance performance and metadata overhead. Overly granular partitions (e.g., daily) can lead to excessive metadata management, slowing down operations. Together, these strategies aligned partitioning, Z-Order indexing, and optimal partition granularity ensure efficient MERGE INTO operations by targeting specific Delta partitions, leveraging claim_month for pruning, and preserving Kafka metadata for traceability.
This approach ensures efficient MERGE INTO operations by targeting specific Delta partitions, leveraging claim_month to prune irrelevant data, and maintaining traceability with Kafka metadata.

I hope this info helpful

Janice Chi 100 Reputation points

2025-06-14T15:45:00.8166667+00:00

Q1. is this possible that Partition Key in Kafka Messages is different from that that of Delta table’s partition column in the payload or key. if yes then how to handle that ?

Share via

Kafka Partitionings vs DB partitions

1 answer

Your answer