CDC pipeline schema handling

Janice Chi 100 Reputation points
2025-06-12T13:24:09.4333333+00:00

We completed historical migration from DB2 to Azure SQL Hyperscale and ADLS Gen2 (Delta format, partitioned). Now building Catch-Up CDC pipelines using Kafka (via IBM CDC), ADF (orchestration), and Databricks (Delta processing). CDC data is merged with historical data at partition level and written to Hyperscale using staged MERGE logic. Goal is dynamic schema handling and efficient processing per Kafka topic.


❓Refined Questions for Microsoft

Schema Handling Across Kafka Topics (800 total): Each Kafka topic maps to one table and has its own schema. – What’s the best practice to pass and validate the Kafka topic schema at runtime in Databricks when processing messages dynamically? – Can we define schema in a metadata/control table and deserialize JSON dynamically per topic? If yes, how?

Flattening CDC Messages (before/after blocks): Kafka messages contain before and after blocks per event (insert/update/delete). – Will the before and after blocks contain all columns for the primary key or only the changed columns in case of update? – What’s the recommended Spark approach to flatten these fields and ensure schema consistency across CDC batches?

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,483 questions
{count} votes

1 answer

Sort by: Most helpful
  1. J N S S Kasyap 3,300 Reputation points Microsoft External Staff Moderator
    2025-06-12T14:22:51.5733333+00:00

    Hi @Janice Chi
    You're navigating a complex CDC pipeline involving Kafka, ADF, and Databricks great job so far. Here's how to handle dynamic schemas and CDC message structures effectively: 
    1.Dynamic Schema Handling: 

    • Use a control table to store each Kafka topic's schema (as JSON or DDL). 
    • In Databricks, retrieve the schema based on topic name and apply it dynamically using from_json(). This enables schema evolution handling and avoids hardcoding. 

    2.Flattening CDC Messages (before/after blocks): 

    • CDC messages from tools like IBM InfoSphere CDC usually follow a { before, after, op } structure. 
    • In update events, the after block may include only changed columns — not the entire row  so verify that primary keys are always included. 
    • Use Spark transformations like selectExpr, withColumn, and col("after.field").alias("field") to flatten the structure and normalize schema across all events.

    This setup ensures schema-aligned CDC processing, which is critical for safe MERGE INTO operations in Delta Lake or Azure SQL. 

    I hope this information helps. Please do let us know if you have any further queries.

    Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.

    Thank you.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.