Azure Databricks

0 answers

Retry and Failure Handling Strategy for CDC Merge Pipeline from Kafka to Databricks and Hyperscale

In our CDC ingestion architecture, we are processing incremental changes( 3000-30,000 events/sec) , 800 topics for 800 tables from IBM DB2 using Kafka topics (via IBM InfoSphere CDC), with the following two stages: Kafka to Databricks Silver Layer: We…

asked

Janice Chi 100

edited the question

Janice Chi 100

1 answer

When to Use MERGE INTO vs APPLY CHANGES INTO in Databricks CDC Pipelines

Background: In our CDC pipeline, we use Databricks to process Kafka CDC data (I/U/D events) into Delta tables. We’re evaluating whether to continue using MERGE INTO or shift to APPLY CHANGES INTO. ❓ Questions for Microsoft: When should we prefer APPLY…

asked

Janice Chi 100

commented

Marcin Policht 49,005 MVP Volunteer Moderator

1 answer

Kafka Partitionings vs DB partitions

We are working on a large-scale CDC ingestion pipeline after completion of One time historicsl Migration where we have already imported 80 TB of data vi ADF to bronze layer where: Source: IBM DB2 (on-prem) CDC Tool: IBM InfoSphere CDC publishes to…

asked

Janice Chi 100

commented

Janice Chi 100

1 answer

Guidance on Connecting Azure Databricks to External Kafka Cluster (GCP-Hosted) for Structured Streaming Ingestion

We are implementing a real-time ingestion pipeline where Azure Databricks (in our tenant) consumes CDC data directly from a Kafka cluster hosted on GCP (external to Azure). The Kafka topics are populated by IBM InfoSphere CDC and are available in…

asked

Janice Chi 100

commented

Shraddha Pore 445 Microsoft External Staff Moderator

0 answers

Hash calculation strategy for datatypes mismatch

In our current project, we are migrating data from an on-premises IBM DB2 system to Azure SQL Hyperscale, using Azure Databricks for transformation and reconciliation. This includes both batch and CDC-based pipelines. Our project requirement is not just…

asked

Janice Chi 100

0 answers

Essential Data Cleaning steps to Succeed Recon

In our project, we are migrating 80TB of production-grade data from DB2 to Azure SQL Hyperscale using ADF and Databricks. While the primary transformation is data type conversion, what essential data cleaning steps should be performed to ensure…

asked

Janice Chi 100

commented

Janice Chi 100

0 answers

Handling DATETIME2 compatibility issue in Databricks during Hyperscale type alignment

In our project, we are transforming data in Azure Databricks coming from source systems (DB2 via CDC or snapshots) and storing it temporarily in Delta Lake. We later load this data into Azure SQL Hyperscale. To align with Hyperscale’s expected schema, we…

asked

Janice Chi 100

commented

Janice Chi 100

2 answers

Unable to create or bring up the cluster on azure databricks. - Failed to perform resource identity operation

Hi, We have set up an Azure Databricks service along supporting services and it was working fine until the below changes were performed on Azure subscription. Details of changes: --> subscriptions was moved to a different directory Post this change,…

asked

Sandeep Jidagi 0

commented

Shraddha Pore 445 Microsoft External Staff Moderator

1 answer

mpact of Kafka Partition Size on Databricks Streaming Performance When Writing to Azure SQL Hyperscale

n our project, we are using Databricks (not ADF) for both catch-up and real-time CDC ingestion from Kafka topics and writing the output directly to Azure SQL Hyperscale via JDBC. Some of our source Kafka topics (originating from DB2 CDC) may have large…

asked

Janice Chi 100

edited a comment

Chandra Boorla 13,790 Microsoft External Staff Moderator

0 answers

Best Practice for Schema Enforcement and Type Casting in Databricks Ingestion Pipeline

In our data pipeline, I'm considering importing source data into Databricks in raw string format (regardless of original data types), performing column-level validation, and only then applying explicit type casting to desired data types before writing to…

asked

Janice Chi 100

commented

J N S S Kasyap 3,300 Microsoft External Staff Moderator

1 answer

Integrity checks via Databricks

In our pipeline, the source data is coming from DB2 production systems and is assumed to be highly reliable. During transformation in Databricks, we are already performing column-level checks (e.g., presence of mandatory fields, null validation,…

asked

Janice Chi 100

answered

Kyle Burns 246 Microsoft Employee

3 answers

I can't get databricks to talk to my storage account. Error 403

I can't get the data bricks to mount my data lake storage. I get error 403 no matter what I do.

asked

Dev, Roger (RIS-HBE) 0

commented

Pritam Kabiraj 235 Microsoft External Staff Moderator

0 answers

CDC pipeline schema handling

We completed historical migration from DB2 to Azure SQL Hyperscale and ADLS Gen2 (Delta format, partitioned). Now building Catch-Up CDC pipelines using Kafka (via IBM CDC), ADF (orchestration), and Databricks (Delta processing). CDC data is merged with…

asked

Janice Chi 100

commented

Janice Chi 100

0 answers

CDC KAFKA recon strategy

We are implementing a CDC-based ingestion using IBM InfoSphere CDC pushing data to Kafka. Downstream, the data is consumed in Databricks, processed, and written to Azure SQL Hyperscale. We use run-wise ingestion into the Bronze layer and perform a…

asked

Janice Chi 100

commented

J N S S Kasyap 3,300 Microsoft External Staff Moderator

1 answer

I am not able to create cluster. I am trying to create single node cluster. However , when I am trying to select node type all seems disabled. Also I am getting a message that "cluster cannot be created because no node is enabled for this subscription".

I am trying to create a single node cluster using my free trial account . I have selected west india as my region while creating resource group. When I am trying to create "all purpose cluster" and then I am trying to select node type, all VM…

asked

bikash hota 0

edited an answer

Krupal Bandari 660 Microsoft External Staff Moderator

1 answer

Guidance on Designing Control Tables Across Historical Migration, Catch-up, and Streaming Phases

In our project, we are handling data ingestion in three phases: One-time historical migration (via ADF) Catch-up CDC (Kafka from IBM CDC) Real-time streaming (Structured Streaming with Databricks) We have already designed separate control tables for…

asked

Janice Chi 100

edited a comment

Chandra Boorla 13,790 Microsoft External Staff Moderator

1 answer

How to proactively avoid micro-batch data loss or duplication during Structured Streaming in high-volume Kafka-to-Azure SQL pipeline?

We are currently implementing a near real-time streaming architecture as part of a modernization project. In our streaming phase, we are consuming data from Kafka topics (one per table, approx. 800 total) using Databricks Structured Streaming and…

asked

Janice Chi 100

commented

J N S S Kasyap 3,300 Microsoft External Staff Moderator

1 answer

EIGHT HUNDRED KAFKA TOPICS PROCESSING BY DBR

We are working on a large-scale Change Data Capture (CDC) implementation where: The source system is IBM DB2. IBM InfoSphere CDC pushes changes to Kafka, with each Kafka topic representing one DB2 table. There are 800 Kafka topics in total, please note…

asked

Janice Chi 100

answered

Smaran Thoomu 24,015 Microsoft External Staff Moderator

2 answers

azure pricing calculator

why I am not able to use pricing calculator after log in to my account

asked

Janice Chi 100

commented

Shraddha Pore 445 Microsoft External Staff Moderator

2 answers

Unable to launch single node Databricks cluster in Free trial subscription

Hello, I am unable to create / launch single node cluster in Azure databricks. I believe in free trial subscription one can try out databricks by creating a single node (4 vCPU core). I have even tried out creating the databricks service in…

asked

Arnab_Azure_Learner 21

commented

bikash hota 0

Filter

Content

2,481 questions with Azure Databricks tags

Retry and Failure Handling Strategy for CDC Merge Pipeline from Kafka to Databricks and Hyperscale

When to Use MERGE INTO vs APPLY CHANGES INTO in Databricks CDC Pipelines

Kafka Partitionings vs DB partitions

Guidance on Connecting Azure Databricks to External Kafka Cluster (GCP-Hosted) for Structured Streaming Ingestion

Hash calculation strategy for datatypes mismatch

Essential Data Cleaning steps to Succeed Recon

Handling DATETIME2 compatibility issue in Databricks during Hyperscale type alignment

Unable to create or bring up the cluster on azure databricks. - Failed to perform resource identity operation

mpact of Kafka Partition Size on Databricks Streaming Performance When Writing to Azure SQL Hyperscale

Best Practice for Schema Enforcement and Type Casting in Databricks Ingestion Pipeline

Integrity checks via Databricks

I can't get databricks to talk to my storage account. Error 403

CDC pipeline schema handling

CDC KAFKA recon strategy

I am not able to create cluster. I am trying to create single node cluster. However , when I am trying to select node type all seems disabled. Also I am getting a message that "cluster cannot be created because no node is enabled for this subscription".

Guidance on Designing Control Tables Across Historical Migration, Catch-up, and Streaming Phases

How to proactively avoid micro-batch data loss or duplication during Structured Streaming in high-volume Kafka-to-Azure SQL pipeline?

EIGHT HUNDRED KAFKA TOPICS PROCESSING BY DBR

azure pricing calculator

Unable to launch single node Databricks cluster in Free trial subscription