CDC pipeline schema handling

Question

CDC pipeline schema handling

Janice Chi 100

We completed historical migration from DB2 to Azure SQL Hyperscale and ADLS Gen2 (Delta format, partitioned). Now building Catch-Up CDC pipelines using Kafka (via IBM CDC), ADF (orchestration), and Databricks (Delta processing). CDC data is merged with historical data at partition level and written to Hyperscale using staged MERGE logic. Goal is dynamic schema handling and efficient processing per Kafka topic.

❓Refined Questions for Microsoft

Schema Handling Across Kafka Topics (800 total): Each Kafka topic maps to one table and has its own schema. – What’s the best practice to pass and validate the Kafka topic schema at runtime in Databricks when processing messages dynamically? – Can we define schema in a metadata/control table and deserialize JSON dynamically per topic? If yes, how?

Flattening CDC Messages (before/after blocks): Kafka messages contain before and after blocks per event (insert/update/delete). – Will the before and after blocks contain all columns for the primary key or only the changed columns in case of update? – What’s the recommended Spark approach to flatten these fields and ensure schema consistency across CDC batches?

J N S S Kasyap 3,300 Reputation points Microsoft External Staff Moderator

2025-06-13T08:10:16.2966667+00:00

@Janice Chi We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

1 answer

Your answer

J N S S Kasyap 3,300 Reputation points Microsoft External Staff Moderator

2025-06-13T08:10:16.2966667+00:00

@Janice Chi We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Answer 1

J N S S Kasyap 3,300 Microsoft External Staff Moderator

Hi @Janice Chi
You're navigating a complex CDC pipeline involving Kafka, ADF, and Databricks great job so far. Here's how to handle dynamic schemas and CDC message structures effectively:
1.Dynamic Schema Handling:

Use a control table to store each Kafka topic's schema (as JSON or DDL).
In Databricks, retrieve the schema based on topic name and apply it dynamically using from_json(). This enables schema evolution handling and avoids hardcoding.

2.Flattening CDC Messages (before/after blocks):

CDC messages from tools like IBM InfoSphere CDC usually follow a { before, after, op } structure.
In update events, the after block may include only changed columns — not the entire row so verify that primary keys are always included.
Use Spark transformations like selectExpr, withColumn, and col("after.field").alias("field") to flatten the structure and normalize schema across all events.

This setup ensures schema-aligned CDC processing, which is critical for safe MERGE INTO operations in Delta Lake or Azure SQL.

I hope this information helps. Please do let us know if you have any further queries.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

Thank you.

Janice Chi 100 Reputation points

2025-06-13T08:45:26.5433333+00:00

In kafka topic if we have only those columns where operation "I/U/D" will happen then how we will flatten this json in comparison with DB table for which this Kafka topic is there because in that case there are all columns not only those on which operation "I/U/D" will happened does this step really required ?

to merge Kafka cdc data with historical data Can you provide sample kafka topic ingested by IBM , does kafka topic payload data changes with change in source like IBM cdc or something else

is there any GitHub repositories ,blog or something where we can find those ?

J N S S Kasyap 3,300 Microsoft External Staff Moderator

@Janice Chi

You're asking excellent, deep-dive questions that show a real-world grasp of CDC pipelines using Kafka + IBM CDC + Databricks. Let's tackle them systematically:

If Kafka CDC messages (e.g., "U") contain only changed columns, how do we flatten and compare with the full DB table for MERGE INTO? Is this step really needed?

Yes, flattening and schema alignment is required. Here's why:

In IBM InfoSphere CDC Kafka payloads, each message typically looks like:

{
  "before": {
    "id": 101,
    "name": "John"
  },
  "after": {
    "id": 101,
    "name": "Jonathan"
  },
  "op": "U",
  "ts_ms": 1718539300000
}

For Update operations:

The after block may contain only changed columns (e.g., just "name").
Primary key(s) are typically guaranteed in either before or after, depending on config.

Why Flattening required:
You need a full record for reliable upserts or SCD2 logic (not just changed fields). To MERGE accurately:

Join the CDC after partial row with historical row using primary key.
Use historical values to fill missing columns → "reconstruct full row". Recommended practice from our side :
Load after block with from_json → Spark StructType schema.
Perform a left join with historical data on primary key.
Use coalesce(cdc_col, hist_col) logic to fill missing columns for a complete record.

Does Kafka payload structure change if CDC source changes (e.g., IBM InfoSphere CDC vs Debezium)?

Yes, CDC tool dictates Payload shape.


CDC Tool	Payload Format	Notes
IBM InfoSphere CDC	Custom JSON (before/after/op)	Often includes metadata and may omit unchanged fields
Debezium	Standard JSON schema	Consistently includes schema, supports Avro/JSON formats
SQL-based log reader	May differ entirely	Might produce only change-sets or row-level DML statements

Your Spark code must be flexible to CDC source. Always inspect the actual Kafka payload format from your specific tool.

Can we define schema per topic in a metadata/control table and apply dynamically using Spark?

Yes ,this is best practice for schema evolution and multi-topic pipelines.

Control table Example

topic_name table_name schema_json

db2.table1 table1 { "id": "int", "name": "string" }

db2.table2 table2 {...}

2.Pyspark logic

from pyspark.sql.functions import from_json, col
import json
from pyspark.sql.types import StructType

topic = "db2.table1"
schema_str = control_table.filter(col("topic_name") == topic).select("schema_json").collect()[0][0]
schema = StructType.fromJson(json.loads(schema_str))

df_flat = df_kafka.withColumn("data", from_json(col("after"), schema)).select("data.*")

This allows dynamic deserialization and supports schema evolution over time

Here's a sample message format observed in many IBM CDC-to-Kafka pipelines:

{
  "schema": {
    "type": "struct",
    "fields": [
      {"field": "id", "type": "int32"},
      {"field": "name", "type": "string"},
      {"field": "op", "type": "string"}
    ],
    "optional": false
  },
  "payload": {
    "before": {"id": 101, "name": "John"},
    "after": {"id": 101, "name": "Jonathan"},
    "op": "U",
    "ts_ms": 1718539300000
  }
}

The exact structure may vary based on how IBM CDC was configured (Avro, JSON, Debezium-like).

GitHub Repositories

https://github.com/Azure-Samples/modern-data-warehouse-dataops

https://github.com/debezium/debezium-examples

https://github.com/confluentinc/kafka-streams-examples

Disclaimer: This response contains a reference to a third-party World Wide Web site. Microsoft is providing this information as a convenience to you. Microsoft does not control these sites and has not tested any software or information found on these sites; therefore, Microsoft cannot make any representations regarding the quality, safety, or suitability of any software or information found there. There are inherent dangers in the use of any software found on the Internet, and Microsoft cautions you to make sure that you completely understand the risk before retrieving any software from the Internet.

I hope this information helps. Please do let us know if you have any further queries.

Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

I hope this information helps. Please do let us know if you have any further queries.


topic_name	table_name	schema_json
db2.table1	table1	{ "id": "int", "name": "string" }
db2.table2	table2	{...}

Share via

CDC pipeline schema handling

1 answer

Your answer