2,481 questions with Azure Databricks tags

Sort by: Updated
0 answers

Retry and Failure Handling Strategy for CDC Merge Pipeline from Kafka to Databricks and Hyperscale

In our CDC ingestion architecture, we are processing incremental changes( 3000-30,000 events/sec) , 800 topics for 800 tables from IBM DB2 using Kafka topics (via IBM InfoSphere CDC), with the following two stages: Kafka to Databricks Silver Layer: We…

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,481 questions
asked 2025-06-14T18:28:28.0433333+00:00
Janice Chi 100 Reputation points
edited the question 2025-06-14T18:31:44.98+00:00
Janice Chi 100 Reputation points
1 answer

When to Use MERGE INTO vs APPLY CHANGES INTO in Databricks CDC Pipelines

Background: In our CDC pipeline, we use Databricks to process Kafka CDC data (I/U/D events) into Delta tables. We’re evaluating whether to continue using MERGE INTO or shift to APPLY CHANGES INTO. ❓ Questions for Microsoft: When should we prefer APPLY…

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,481 questions
asked 2025-06-14T12:13:23.58+00:00
Janice Chi 100 Reputation points
commented 2025-06-14T16:54:01.9366667+00:00
Marcin Policht 49,005 Reputation points MVP Volunteer Moderator
1 answer

Kafka Partitionings vs DB partitions

We are working on a large-scale CDC ingestion pipeline after completion of One time historicsl Migration where we have already imported 80 TB of data vi ADF to bronze layer where: Source: IBM DB2 (on-prem) CDC Tool: IBM InfoSphere CDC publishes to…

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,481 questions
asked 2025-06-11T12:44:08.54+00:00
Janice Chi 100 Reputation points
commented 2025-06-14T15:45:00.8166667+00:00
Janice Chi 100 Reputation points
1 answer

Guidance on Connecting Azure Databricks to External Kafka Cluster (GCP-Hosted) for Structured Streaming Ingestion

We are implementing a real-time ingestion pipeline where Azure Databricks (in our tenant) consumes CDC data directly from a Kafka cluster hosted on GCP (external to Azure). The Kafka topics are populated by IBM InfoSphere CDC and are available in…

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,481 questions
asked 2025-06-10T15:58:44.08+00:00
Janice Chi 100 Reputation points
commented 2025-06-14T14:44:27.1733333+00:00
Shraddha Pore 445 Reputation points Microsoft External Staff Moderator
0 answers

Hash calculation strategy for datatypes mismatch

In our current project, we are migrating data from an on-premises IBM DB2 system to Azure SQL Hyperscale, using Azure Databricks for transformation and reconciliation. This includes both batch and CDC-based pipelines. Our project requirement is not just…

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,481 questions
asked 2025-06-14T13:08:58.0566667+00:00
Janice Chi 100 Reputation points
0 answers

Essential Data Cleaning steps to Succeed Recon

In our project, we are migrating 80TB of production-grade data from DB2 to Azure SQL Hyperscale using ADF and Databricks. While the primary transformation is data type conversion, what essential data cleaning steps should be performed to ensure…

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,481 questions
asked 2025-06-12T14:17:19.49+00:00
Janice Chi 100 Reputation points
commented 2025-06-14T13:02:25.5033333+00:00
Janice Chi 100 Reputation points
0 answers

Handling DATETIME2 compatibility issue in Databricks during Hyperscale type alignment

In our project, we are transforming data in Azure Databricks coming from source systems (DB2 via CDC or snapshots) and storing it temporarily in Delta Lake. We later load this data into Azure SQL Hyperscale. To align with Hyperscale’s expected schema, we…

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,481 questions
asked 2025-06-13T13:19:48.26+00:00
Janice Chi 100 Reputation points
commented 2025-06-14T08:18:14.9366667+00:00
Janice Chi 100 Reputation points
2 answers

Unable to create or bring up the cluster on azure databricks. - Failed to perform resource identity operation

Hi, We have set up an Azure Databricks service along supporting services and it was working fine until the below changes were performed on Azure subscription. Details of changes: --> subscriptions was moved to a different directory Post this change,…

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,481 questions
asked 2025-06-06T07:12:48.59+00:00
Sandeep Jidagi 0 Reputation points
commented 2025-06-13T16:04:15.3+00:00
Shraddha Pore 445 Reputation points Microsoft External Staff Moderator
1 answer

mpact of Kafka Partition Size on Databricks Streaming Performance When Writing to Azure SQL Hyperscale

n our project, we are using Databricks (not ADF) for both catch-up and real-time CDC ingestion from Kafka topics and writing the output directly to Azure SQL Hyperscale via JDBC. Some of our source Kafka topics (originating from DB2 CDC) may have large…

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,481 questions
asked 2025-06-10T16:08:12.71+00:00
Janice Chi 100 Reputation points
edited a comment 2025-06-13T16:01:29.57+00:00
Chandra Boorla 13,790 Reputation points Microsoft External Staff Moderator
0 answers

Best Practice for Schema Enforcement and Type Casting in Databricks Ingestion Pipeline

In our data pipeline, I'm considering importing source data into Databricks in raw string format (regardless of original data types), performing column-level validation, and only then applying explicit type casting to desired data types before writing to…

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,481 questions
asked 2025-06-13T13:11:47.9+00:00
Janice Chi 100 Reputation points
commented 2025-06-13T14:18:10.9066667+00:00
J N S S Kasyap 3,300 Reputation points Microsoft External Staff Moderator
1 answer

Integrity checks via Databricks

In our pipeline, the source data is coming from DB2 production systems and is assumed to be highly reliable. During transformation in Databricks, we are already performing column-level checks (e.g., presence of mandatory fields, null validation,…

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,481 questions
asked 2025-06-13T11:26:04.89+00:00
Janice Chi 100 Reputation points
answered 2025-06-13T12:38:27.3733333+00:00
Kyle Burns 246 Reputation points Microsoft Employee
3 answers

I can't get databricks to talk to my storage account. Error 403

I can't get the data bricks to mount my data lake storage. I get error 403 no matter what I do.

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,481 questions
asked 2025-06-09T17:46:54.1166667+00:00
Dev, Roger (RIS-HBE) 0 Reputation points
commented 2025-06-13T10:35:54.21+00:00
Pritam Kabiraj 235 Reputation points Microsoft External Staff Moderator
0 answers

CDC pipeline schema handling

We completed historical migration from DB2 to Azure SQL Hyperscale and ADLS Gen2 (Delta format, partitioned). Now building Catch-Up CDC pipelines using Kafka (via IBM CDC), ADF (orchestration), and Databricks (Delta processing). CDC data is merged with…

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,481 questions
asked 2025-06-12T13:24:09.4333333+00:00
Janice Chi 100 Reputation points
commented 2025-06-13T08:45:26.5433333+00:00
Janice Chi 100 Reputation points
0 answers

CDC KAFKA recon strategy

We are implementing a CDC-based ingestion using IBM InfoSphere CDC pushing data to Kafka. Downstream, the data is consumed in Databricks, processed, and written to Azure SQL Hyperscale. We use run-wise ingestion into the Bronze layer and perform a…

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,481 questions
asked 2025-06-12T13:38:05.87+00:00
Janice Chi 100 Reputation points
commented 2025-06-13T08:10:03.4466667+00:00
J N S S Kasyap 3,300 Reputation points Microsoft External Staff Moderator
1 answer

I am not able to create cluster. I am trying to create single node cluster. However , when I am trying to select node type all seems disabled. Also I am getting a message that "cluster cannot be created because no node is enabled for this subscription".

I am trying to create a single node cluster using my free trial account . I have selected west india as my region while creating resource group. When I am trying to create "all purpose cluster" and then I am trying to select node type, all VM…

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,481 questions
asked 2025-06-11T04:42:29.6+00:00
bikash hota 0 Reputation points
edited an answer 2025-06-13T01:27:04.44+00:00
Krupal Bandari 660 Reputation points Microsoft External Staff Moderator
1 answer

Guidance on Designing Control Tables Across Historical Migration, Catch-up, and Streaming Phases

In our project, we are handling data ingestion in three phases: One-time historical migration (via ADF) Catch-up CDC (Kafka from IBM CDC) Real-time streaming (Structured Streaming with Databricks) We have already designed separate control tables for…

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,481 questions
asked 2025-06-11T16:59:44.7+00:00
Janice Chi 100 Reputation points
edited a comment 2025-06-12T20:08:08.1866667+00:00
Chandra Boorla 13,790 Reputation points Microsoft External Staff Moderator
1 answer

How to proactively avoid micro-batch data loss or duplication during Structured Streaming in high-volume Kafka-to-Azure SQL pipeline?

We are currently implementing a near real-time streaming architecture as part of a modernization project. In our streaming phase, we are consuming data from Kafka topics (one per table, approx. 800 total) using Databricks Structured Streaming and…

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,481 questions
asked 2025-06-10T08:04:08.79+00:00
Janice Chi 100 Reputation points
commented 2025-06-11T11:32:41.5633333+00:00
J N S S Kasyap 3,300 Reputation points Microsoft External Staff Moderator
1 answer

EIGHT HUNDRED KAFKA TOPICS PROCESSING BY DBR

We are working on a large-scale Change Data Capture (CDC) implementation where: The source system is IBM DB2. IBM InfoSphere CDC pushes changes to Kafka, with each Kafka topic representing one DB2 table. There are 800 Kafka topics in total, please note…

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,481 questions
asked 2025-06-11T07:06:44.0833333+00:00
Janice Chi 100 Reputation points
answered 2025-06-11T08:04:03.1533333+00:00
Smaran Thoomu 24,015 Reputation points Microsoft External Staff Moderator
2 answers

azure pricing calculator

why I am not able to use pricing calculator after log in to my account

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,481 questions
asked 2025-06-06T13:59:10.5233333+00:00
Janice Chi 100 Reputation points
commented 2025-06-11T06:02:46.1766667+00:00
Shraddha Pore 445 Reputation points Microsoft External Staff Moderator
2 answers One of the answers was accepted by the question author.

Unable to launch single node Databricks cluster in Free trial subscription

Hello, I am unable to create / launch single node cluster in Azure databricks. I believe in free trial subscription one can try out databricks by creating a single node (4 vCPU core). I have even tried out creating the databricks service in…

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,481 questions
asked 2022-04-19T17:48:57.94+00:00
Arnab_Azure_Learner 21 Reputation points
commented 2025-06-11T04:25:00.3733333+00:00
bikash hota 0 Reputation points