When to Use MERGE INTO vs APPLY CHANGES INTO in Databricks CDC Pipelines

Question

When to Use MERGE INTO vs APPLY CHANGES INTO in Databricks CDC Pipelines

Janice Chi 100

Background: In our CDC pipeline, we use Databricks to process Kafka CDC data (I/U/D events) into Delta tables. We’re evaluating whether to continue using MERGE INTO or shift to APPLY CHANGES INTO.

❓ Questions for Microsoft:

When should we prefer APPLY CHANGES INTO over MERGE INTO for CDC processing?

What are the key limitations of MERGE INTO that APPLY CHANGES INTO addresses (e.g., performance, concurrency, native CDC support)?

What are the main differences in capability and behavior between the two (streaming support, schema evolution, error handling, deduplication)?

Can APPLY CHANGES INTO fully replace MERGE INTO in structured streaming pipelines using CDC I/U/D logic?

Are there cost or performance advantages in large-scale Delta tables when using APPLY CHANGES INTO

Shraddha Pore 445 Reputation points Microsoft External Staff Moderator

2025-06-14T15:26:47.37+00:00
Hi Janice Chi, Thank you so much for query.

please explain this "However, for non-standard merge logic, complex joins, or custom SCD patterns beyond Type 1, MERGE INTO might still be needed. MERGE INTO is more flexible for batch-based and one-off ad-hoc operations."

Handling complex merge logic and joins: Sometimes, merging data isn’t as simple as matching keys one-to-one. When the rules for combining tables get more complicated — like matching on multiple conditions or involving intricate relationships — the MERGE INTO statement is your go-to because it lets you specify detailed matching criteria.

Managing advanced Slowly Changing Dimensions (SCDs): If you’re just overwriting data with no history (SCD Type 1), simpler methods might work. But when you want to keep track of changes over time — like with SCD Type 2 or Type 3, where history or previous versions are important — you need more complex logic. MERGE INTO can handle these scenarios smoothly by allowing custom update and insert operations in one step.

Great for both batch jobs and one-off tasks: Whether you’re running large scheduled data loads or doing a one-time manual update, MERGE INTO works well. Because it combines insert, update, and delete operations into a single atomic command, it’s very flexible and reduces the risk of data inconsistency.

Benefits of Using MERGE INTO

Atomicity: You can perform multiple changes (inserts, updates, deletes) all at once, which helps keep your data consistent and avoids partial updates.

Efficiency: It’s built to handle big datasets effectively, making it a solid choice for batch processing.

Flexibility: It supports complex join conditions and various transformation rules, so it adapts well to many different data integration needs.

Things to Keep in Mind

Avoid duplicate matches: Make sure your join conditions uniquely identify rows on both sides. Otherwise, the operation can fail or cause unexpected results.

Performance tuning: With very large datasets, it’s smart to use indexing and check execution plans to keep things running smoothly.

Concurrency considerations: If many processes are accessing or modifying the data simultaneously, be aware of locking issues or conflicts that might arise during the merge.

You can refer Documentation Also Documentation

Please do not forget to click "Upvote the comment” and Yes wherever the information provided helps you, this can be beneficial to other community members.

If you have any other questions or still running into more issues, let me know in the "comments" and I would be happy to help you.

1 answer

Your answer

Answer 1

When should we prefer APPLY CHANGES INTO over MERGE INTO for CDC processing?

Use APPLY CHANGES INTO when:

You're working with structured streaming CDC data and need declarative semantics for I/U/D (Insert/Update/Delete).
You want to reduce boilerplate code around deduplication, ordering, and CDC event interpretation.
You want better performance and scalability on large tables with native change data capture support.
You want native support for slowly changing dimensions (SCD Type 1) patterns.

In general, APPLY CHANGES INTO is purpose-built for CDC pipelines, while MERGE INTO is a more general-purpose tool.

What key limitations of MERGE INTO does APPLY CHANGES INTO address?

Limitation of `MERGE INTO`	Addressed by `APPLY CHANGES INTO`
Not natively optimized for CDC data (I/U/D semantics must be coded manually)	Natively understands and handles CDC input schema
Manual deduplication and ordering logic required	Automatically deduplicates based on sequence number and primary key
Slower performance on large tables with high-frequency updates	Optimized for streaming ingestion with better write path performance
Lacks built-in semantics for deletes and SCD patterns	Declaratively supports deletes (`DELETE`, `UPDATE`, `INSERT` logic)
Complexity increases with schema evolution	Supports basic schema evolution out of the box

Main differences in capability and behavior

Feature	`MERGE INTO`	`APPLY CHANGES INTO`
Streaming support	Supported, but not optimized	Natively built for structured streaming
CDC semantics	Manual	Declarative (Insert, Update, Delete)
Deduplication	Must implement manually	Built-in using sequence + keys
Concurrency	Possible issues in concurrent writes	Better concurrency handling in streaming
Schema evolution	Supported (but can be complex)	Supported with less overhead
Error handling	Manual	Automatic error modes available
Performance	Slower for high-churn workloads	More efficient for large-scale streaming updates
Code complexity	Higher	Lower (fewer lines, less logic)

Can APPLY CHANGES INTO fully replace MERGE INTO in structured streaming pipelines with CDC logic?

In most cases, yes. If your data has a clear primary key, a reliable timestamp or sequence column, and your CDC source clearly defines the operation type (I/U/D), APPLY CHANGES INTO is a better fit and can fully replace MERGE INTO. However, for non-standard merge logic, complex joins, or custom SCD patterns beyond Type 1, MERGE INTO might still be needed. MERGE INTO is more flexible for batch-based and one-off ad-hoc operations.

Are there cost or performance advantages for large-scale Delta tables when using APPLY CHANGES INTO?

Yes. Major advantages include:

Lower compute cost: Efficiently processes CDC updates in micro-batches without repeatedly scanning large target tables.
Reduced I/O: Native optimizations reduce the number of files rewritten, especially important in large Delta tables.
Better scalability: Handles high-velocity streams with millions of events more effectively than MERGE INTO.
Optimized transaction overhead: Particularly important with Unity Catalog and concurrent writers.

If the above response helps answer your question, remember to "Accept Answer" so that others in the community facing similar issues can easily find the solution. Your contribution is highly appreciated.

hth

Marcin

Janice Chi 100 Reputation points

2025-06-14T15:03:29.2766667+00:00

please explian this "However, for non-standard merge logic, complex joins, or custom SCD patterns beyond Type 1, MERGE INTO might still be needed. MERGE INTO is more flexible for batch-based and one-off ad-hoc operations."
Marcin Policht 49,005 Reputation points MVP Volunteer Moderator

2025-06-14T15:22:01.5733333+00:00

In short, while APPLY CHANGES INTO is preferable for standard CDC use cases, there are scenarios where MERGE INTO might be still necessary—particularly when you require non-standard merge logic, such as complex conditions, joins with multiple tables, or implementing custom Slowly Changing Dimension (SCD) patterns beyond Type 1 (e.g., Type 2 with historical tracking). Additionally, MERGE INTO offers more flexibility for batch jobs and ad-hoc operations that aren’t part of a continuous streaming pipeline, making it a better fit for one-time data corrections, backfills, or when handling heterogeneous data transformation requirements.

If the above response helps answer your question, remember to "Accept Answer" so that others in the community facing similar issues can easily find the solution. Your contribution is highly appreciated.

hth

Marcin
Janice Chi 100 Reputation points

2025-06-14T16:01:13.1566667+00:00

In our project after one time historical migration, we need to merge CDC data from kafka to bronze layer data that was came in historical migration step and we decided to do same on the basis of offsets and then recon also and after catchup/cdc phase we need to perform streaming NRT where we will insert the data from source to sink per microbatch basis using checkpoint/watermark and recon so where we should Mergeinto and where Applyinto and why
Marcin Policht 49,005 Reputation points MVP Volunteer Moderator

2025-06-14T16:54:01.9366667+00:00

You might want to use MERGE INTO in phase 1 (catch-up CDC). You are merging Kafka CDC data into a static Bronze layer that came from a historical dump. The focus is on offset-based reconciliation to ensure no data duplication or missed events during the "catch-up" phase. This is typically batch-driven, potentially running in micro-batch jobs, but not always fully streaming. You may need custom matching logic (e.g., match on multiple keys, ignore tombstone events, handle duplicates explicitly). MERGE INTO allows full control over merge conditions, is ideal for one-time heavy transformations, and supports fine-grained logic needed in offset alignment and reconciliation.

Then, in phase 2 (streaming/NRT after catch-up), use APPLY CHANGES INTO. At this point, the pipeline shifts to continuous ingestion from Kafka using structured streaming. You're operating on well-ordered, real-time CDC events, with watermarks and checkpoints ensuring exactly-once semantics. APPLY CHANGES INTO is optimized for built-in deduplication using sequence number or event timestamp, as well as for declarative support for INSERT, UPDATE, and DELETE. It also offers better scalability and throughput in large-scale streaming workloads

If the above response helps answer your question, remember to "Accept Answer" so that others in the community facing similar issues can easily find the solution. Your contribution is highly appreciated.

hth

Marcin

Share via

When to Use MERGE INTO vs APPLY CHANGES INTO in Databricks CDC Pipelines

1 answer

Your answer